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Preface to the third edition 


In the nine years since the manuscript for the second edition of The 
Econometric Modelling of Financial Time Series was completed there have 
continued to be many advances in time series econometrics, some of which 
have been in direct response to features found in the data coming from 
financial markets, while others have found ready application in financial 
fields. Incorporating these developments was too much for a single author, 
particularly one whose interests have diverged from financial econometrics 
quite significantly in the intervening years! Raphael Markellos has thus 
become joint author, and his interests and expertise in finance now 
permeate throughout this new edition, which has had to be lengthened 
somewhat to accommodate many new developments in the area. 

Chapters 1 and 2 remain essentially the same as in the second edition, 
although examples have been updated. The material on unit roots and 
associated techniques has continued to expand, so much so that it now has 
an entire chapter, 3, devoted to it. The remaining material on univariate 
linear stochastic models now comprises chapter 4, with much more on 
fractionally differenced processes being included in response to develop- 
ments in recent years. Evidence of non-linearity in financial time series has 
continued to accumulate, and stochastic variance models and the many 
extensions of the ARCH process continue to be very popular, along with the 
related area of modelling volatility. This material now forms chapter 5, with 
further non-linear models and tests of non-linearity providing the material 
for chapter 6. Chapter 7 now contains the material on modelling return 
distributions and transformations of returns. Much of the material of 
chapters 8, 9 and 10 (previously chapters 6, 7 and 8) remains as before, but 
with expanded sections on, for example, non-linear generalisations of 
cointegration. 


Introduction 


The aim of this book is to provide the researcher in financial markets with the 
techniques necessary to undertake the empirical analysis of financial time series. 
To accomplish this aim we introduce and develop both univariate modelling 
techniques and multivariate methods, including those regression techniques 
for time series that seem to be particularly relevant to the finance area. 

Why do we concentrate exclusively on time series techniques when, for 
example, cross-sectional modelling plays an important role in empirical 
investigations of the capital asset pricing model (CAPM; see, as an early and 
influential example, Fama and MacBeth, 1973)? Moreover, why do we not 
address the many issues involved in modelling financial time series in con- 
tinuous time and the spectral domain, although these approaches have 
become very popular, for example, in the context of derivative asset pricing? 
Our answer is that, apart from the usual considerations of personal expertise 
and interest plus constraints on manuscript length, it is because time series 
analysis, in both its theoretical and empirical aspects, has been for many 
years an integral part of the study of financial markets. 

The first attempts to study the behaviour of financial time series were 
undertaken by financial professionals and journalists rather than by aca- 
demics. Indeed, this seems to have become a long-standing tradition, as, even 
today, much empirical research and development still originates from the 
financial industry itself. This can be explained by the practical nature of the 
problems, the need for specialised data and the potential gains from such 
analysis. The earliest and best-known example of published research on 
financial time series is by the legendary Charles Dow, as expressed in his 
editorials in the Wall Street Times between 1900 and 1902. These writings 
formed the basis of “Dow theory’ and influenced what later became known as 
technical analysis and chartism. Although Dow did not collect and publish 
his editorials separately, this was done posthumously by his follower Samuel 
Nelson (Nelson, 1902). Dow’s original ideas were later interpreted and 
further extended by Hamilton (1922) and Rhea (1932). These ideas enjoyed 
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some recognition amongst academics at the time: for example, Hamilton was 
elected a fellow of the Royal Statistical Society. As characteristically treated by 
Malkiel (2003), however, technical analysis and chartist approaches became 
anathema to academics, despite their widespread popularity amongst 
financial professionals. Although Dow and his followers discussed many of 
the ideas we encounter in modern finance and time series analysis, including 
stationarity, market efficiency, correlation between asset returns and indices, 
diversification and unpredictability, they made no serious effort to adopt 
formal statistical methods. Most of the empirical analysis involved the 
painstaking interpretation of detailed charts of sectoral stock price averages, 
thus forming the celebrated Dow-Jones indices. It was argued that these 
indices discount all necessary information and provide the best predictor of 
future events. A fundamental idea, very relevant to the theory of cycles by 
Stanley Jevons and the ‘Harvard A-B-C curve’ methodology of trend decom- 
position by Warren Persons, was that market price variations consisted of three 
primary movements: daily, medium-term and long-term (see Samuelson, 
1987). Although criticism of Dow theory and technical analysis has been a 
favourite pastime of academics for many years, evidence regarding its merit 
remains controversial (see, for example, Brown, Goetzmann and Kumar, 1998). 

The earliest empirical research using formal statistical methods can be 
traced back to the papers by Working (1934), Cowles (1933, 1944) and 
Cowles and Jones (1937). Working focused attention on a previously noted 
characteristic of commodity and stock prices: namely, that they resemble 
cumulations of purely random changes. Alfred Cowles 3rd, a quantitatively 
trained financial analyst and founder of the Econometric Society and the 
Cowles Foundation, investigated the ability of market analysts and financial 
services to predict future price changes, finding that there was little evidence 
that they could. Cowles and Jones reported evidence of positive correlation 
between successive price changes, but, as Cowles (1960) was later to remark, 
this was probably due to their taking monthly averages of daily or weekly 
prices before computing changes: a ‘spurious correlation’ phenomenon, 
analysed by Working (1960). 

The predictability of price changes has since become a major theme of 
financial research but, surprisingly, little more was published until Kendall’s 
(1953) study, in which he found that the weekly changes in a wide variety of 
financial prices could not be predicted either from past changes in the series 
or from past changes in other price series. This seems to have been the first 
explicit reporting of this oft-quoted property of financial prices, although 
further impetus to research on price predictability was provided only by the 
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publication of the papers by Roberts (1959) and Osborne (1959). The former 
presents a largely heuristic argument as to why successive price changes 
should be independent, while the latter develops the proposition that it is not 
absolute price changes but the logarithmic price changes that are indepen- 
dent of each other. With the auxiliary assumption that the changes them- 
selves are normally distributed, this implies that prices are generated as 
Brownian motion. 

The stimulation provided by these papers was such that numerous articles 
appeared over the next few years investigating the hypothesis that price 
changes (or logarithmic price changes) are independent, a hypothesis that 
came to be termed the ‘random walk’ model, in recognition of the similarity 
of the evolution of a price series to the random stagger of a drunk. Indeed, 
the term ‘random walk’ is believed to have first been used in an exchange of 
correspondence appearing in Nature in 1905 (see Pearson and Rayleigh, 
1905), which was concerned with the optimal search strategy for finding a 
drunk who had been left in the middle of a field at the dead of night! The 
solution is to start exactly where the drunk had been placed, as that point is 
an unbiased estimate of the drunk’s future position since he will presumably 
stagger along in an unpredictable and random fashion. 

The most natural way to state formally the random walk model is as 


P, = Pi + a (1.1) 


where P, is the price observed at the beginning of time t and a, is an error 
term which has zero mean and whose values are independent of each other. 
The price change, AP,= P,— P,—1, is thus simply a, and hence is independent 
of past price changes. Note that, by successive backward substitution in (1.1), 
we can write the current price as the cumulation of all past errors, i.e. 


t 
P, = ; di 
ya 


so that the random walk model implies that prices are indeed generated by 
Working’s ‘cumulation of purely random changes’. Osborne’s model of 
Brownian motion implies that equation (1.1) holds for the logarithms of P, 
and, further, that a, is drawn from a zero mean normal distribution having 
constant variance. 

Most of the early papers in this area are contained in the collection of 
Cootner (1964), while Granger and Morgenstern (1970) provide a detailed 
development and empirical examination of the random walk model and 
various of its refinements. Amazingly, much of this work had been anticipated 
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by the French mathematician Louis Bachelier (1900; English translation in 
Cootner, 1964) in a remarkable PhD thesis in which he developed an ela- 
borate mathematical theory of speculative prices, which he then tested on the 
pricing of French government bonds, finding that such prices were con- 
sistent with the random walk model. What made the thesis even more 
remarkable was that it also developed many of the mathematical properties 
of Brownian motion that had been thought to have first been derived some 
years later in the physical sciences, particularly by Einstein! Yet, as 
Mandelbrot (1989) remarks, Bachelier had great difficulty in even getting 
himself a university appointment, let alone getting his theories disseminated 
throughout the academic community! The importance and influence of 
Bachelier’s path-breaking work is discussed in Sullivan and Weithers (1991) 
and Dimand (1993). 

It should be emphasised that the random walk model is only a hypothesis 
about how financial prices move. One way in which it can be tested is by 
examining the autocorrelation properties of price changes: see, for example, 
Fama (1965). A more general perspective is to view (1.1) as a particular 
model within the class of autoregressive integrated moving average (ARIMA) 
models popularised by Box and Jenkins (1976). Chapter 2 thus develops the 
theory of such models within the general context of (univariate) linear 
stochastic processes. An important aspect of specifying ARIMA models is to 
be able to determine correctly the order of integration of the series being 
analysed and, associated with this, the appropriate way of modelling trends 
and structural breaks. To do this formally requires the application of unit 
root tests and a vast range of related procedures. Tests for unit roots and 
alternative trend specifications are the focus of chapter 3. 

We should avoid giving the impression that the only financial time series 
of interest are stock prices. There are financial markets other than those for 
stocks, most notably for bonds and foreign currency, but there also exist the 
various futures, commodity and derivative markets, all of which provide 
interesting and important series to analyse. For certain of these, it is by no 
means implausible that models other than the random walk may be app- 
ropriate, or, indeed, models from a class other than the ARIMA. Chapter 4 
therefore discusses various topics in the general analysis of linear stochastic 
models: for example, methods of decomposing an observed series into two 
or more unobserved components and of determining the extent of the 
‘memory’ of a series, by which is meant the behaviour of the series at low 
frequencies or, equivalently, in the very long run. A variety of examples taken 
from the financial literature are provided throughout these chapters. 
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The random walk model has been the workhorse of empirical finance for 
many years, mainly because of its simplicity and mathematical tractability. 
Its prominent role was also supported by theoretical models that obtained 
unpredictability as a direct implication of market efficiency, or, more 
broadly speaking, of the condition whereby market prices fully, correctly and 
instantaneously reflect all the available information. An evolving discussion 
of this research can be found in a series of papers by Fama (1970, 1991, 
1998), while Timmermann and Granger (2004) address market efficiency 
from a forecasting perspective. As LeRoy (1989) discusses, it was later shown 
that the random walk behaviour of financial prices is neither a sufficient nor 
a necessary condition for rationally determined financial prices. Moreover, 
the assumption in (1.1) that price changes are independent was found to be 
too restrictive to be generated within a reasonably broad class of optimising 
models. A model that is appropriate, however, can be derived for stock prices 
in the following way (similar models can be derived for other sorts of 
financial prices, although the justification is sometimes different: see LeRoy, 
1982). The return on a stock from t to t+ 1 is defined as the sum of the 
dividend yield and the capital gain — i.e. as 


Pri t Di- Pr 


. (1.2) 


+1 = 
where D, is the dividend paid during period t. Let us suppose that the 
expected return is constant, E;(1;41) = r, where E;( ) is the expectation 
conditional on information available at t: r, is then said to be a fair game. 
Taking expectations at t of both sides of (1.2) and rearranging yields 


P, = (1 + r) E, (Pi + De) (1.3) 


which says that the stock price at the beginning of period t equals the sum of 
the expected future price and dividend, discounted back at the rate r. Now 
assume that there is a mutual fund that holds the stock in question and that it 
reinvests dividends in future share purchases. Suppose that it holds h, shares 
at the beginning of period t, so that the value of the fund is x,=h,P,. The 
assumption that the fund ploughs back its dividend income implies that 
h, 1 satisfies 


hey Pra = (Pii + Dr) 
Thus 


Ex(X¢41) = Ex(hes1 Pry) = E(P + De) = (1+ r)hP, = (1 + r)x 
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i.e. x,is a martingale (if, as is common, r > 0, we have E,(x;41) > xp so that x, 
is a submartingale; LeRoy (1989, pp. 1593-4) offers an example, however, in 
which r could be negative, in which case x; will be a supermartingale). LeRoy 
(1989) emphasises that price itself, without dividends added in, is not gen- 
erally a martingale, since from (1.3) we have 


r= E,(D,)/P, + Ey(Prs1)/P; — 1 


so that only if the expected dividend/price ratio (or dividend yield) is constant, 
say E,(D;)/P; = d, can we write P, as the submartingale (assuming r> d) 


E,(Pi41) = (1 + | d)P, 


The assumption that a stochastic process — yp say — follows a random walk 
is more restrictive than the requirement that y, follows a martingale. The 
martingale rules out any dependence of the conditional expectation of 
Ay;+, on the information available at t, whereas the random walk rules out 
not only this but also dependence involving the higher conditional moments 
of Ay,,,. The importance of this distinction is thus evident: financial series 
are known to go through protracted quiet periods and also protracted per- 
iods of turbulence. This type of behaviour could be modelled by a process in 
which successive conditional variances of Ay;, , (but not successive levels) 
are positively autocorrelated. Such a specification would be consistent with a 
martingale, but not with the more restrictive random walk. 

Martingale processes are discussed in chapter 5, and lead naturally on to 
non-linear stochastic processes that are capable of modelling higher condi- 
tional moments, such as the autoregressive conditionally heteroskedastic 
(ARCH) model introduced by Engle (1982) and stochastic variance models. 
Related to these models is the whole question of how to model volatility 
itself, which is of fundamental concern to financial modellers and is therefore 
also analysed in this chapter. Of course, once we entertain the possibility of 
non-linear generating processes a vast range of possible processes become 
available, and those that have found, at least potential, use in modelling 
financial time series are developed in chapter 6. These include bilinear 
models, Markov switching processes, smooth transitions and chaotic mod- 
els. The chapter also includes a discussion of computer intensive techniques 
such as non-parametric modelling and artificial neural networks. An 
important aspect of nonlinear modelling is to be able to test for nonlinear 
behaviour, and testing procedures thus provide a key section of this chapter. 
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The focus of chapter 7 is on the unconditional distributions of asset returns. 
The most noticeable future of such distributions is their leptokurtic property: 
they have fat tails and high peakedness compared to a normal distribution. 
Although ARCH processes can model such features, much attention in the 
finance literature since Mandelbrot’s (1963a, 1963b) path-breaking papers has 
concentrated on the possibility that returns are generated by a stable process, 
which has the property of having an infinite variance. Recent developments in 
statistical analysis have allowed a much deeper investigation of the tail shapes 
of empirical distributions, and methods of estimating tail shape indices are 
introduced and applied to a variety of returns series. The chapter then looks at 
the implications of fat-tailed distributions for testing the covariance statio- 
narity assumption of time series analysis, data analytic methods of modelling 
skewness and kurtosis, and the impact of analysing transformations of 
returns rather than the returns themselves. 

The remaining three chapters focus on multivariate techniques of time 
series analysis, including regression methods. Chapter 8 concentrates on 
analysing the relationships between a set of stationary — or, more precisely, 
non-integrated — financial time series and considers such topics as general 
dynamic regression, robust estimation, generalised methods of moments, 
multivariate regression, ARCH-in-mean and multivariate ARCH models, 
vector autoregressions, Granger causality, variance decompositions and 
impulse response analysis. These topics are illustrated with a variety of exam- 
ples drawn from the finance literature: using forward exchange rates as optimal 
predictors of future spot rates; modelling the volatility of stock returns and the 
risk premium in the foreign exchange market; testing the CAPM; and inves- 
tigating the interaction of the equity and gilt markets in the United Kingdom. 

Chapter 9 concentrates on the modelling of integrated financial time 
series, beginning with a discussion of the spurious regression problem, 
introducing cointegrated processes and demonstrating how to test for 
cointegration, and then moving on to consider how such processes can be 
estimated. Vector error correction models are analysed in detail, along with 
associated issues in causality testing and impulse response analysis, alter- 
native approaches to testing for the presence of a long-run relationship, and 
the analysis of both common cycles and trends. The techniques introduced 
in this chapter are illustrated with extended examples analysing the market 
model and the interactions of the UK financial markets. 

Finally, chapter 10 considers modelling issues explicit to finance. 
Samuelson (1965, 1973) and Mandelbrot (1966) have analysed the impli- 
cations of equation (1.3), that the stock price at the beginning of time t 
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equals the discounted sum of the next period’s expected future price and 
dividend, to show that this stock price equals the expected discounted, or 
present, value of all future dividends — i.e. that 


PEN (1 +r) OUR (Dis) (1.4) 


which is obtained by recursively solving (1.3) forwards and assuming that 
(1+ r) "E, (Pin) converges to zero as n — oo. Present value models of the 
type (1.4) are analysed comprehensively in chapter 10, with the theme of 
whether stock markets are excessively volatile, perhaps containing spec- 
ulative bubbles, being used extensively throughout the discussion and in a 
succession of examples, although the testing of the expectations hypothesis 
of the term structure of interest rates is also used as an example of the general 
present value framework. The chapter also discusses recent research on non- 
linear generalisations of cointegration and how structural breaks may be 
dealt with in cointegrating relationships. 

Having emphasised earlier in this chapter that the book is exclusively 
about modelling financial time series, we should state at this juncture what 
the book is not about. It is certainly not a text on financial market theory, 
and any such theory is discussed only when it is necessary as a motivation for 
a particular technique or example. There are numerous texts on the theory of 
finance, and the reader is referred to these for the requisite financial theory: 
two notable texts that contain both theory and empirical techniques are 
Campbell, Lo and MacKinlay (1997) and Cuthbertson (1996). Neither is it a 
textbook on econometrics. We assume that the reader already has a working 
knowledge of probability, statistics and econometric theory, in particular 
least squares estimation. Nevertheless, it is also non-rigorous, being at a level 
roughly similar to Mills (1990), in which references to the formal treatment 
of the theory of time series are provided. 

When the data used in the examples throughout the book have already 
been published, references are given. Previous unpublished data are defined 
in the data appendix, which contains details on how they may be accessed. 
All standard regression computations were carried out using EVIEWS 5.0 
(EViews, 2003), but use was also made of STAMP 5.0 (Koopman et al., 2006), 
TSM 4.18 (Davidson, 2006a) and occasionally other econometric packages. 
‘Non-standard’ computations were made using algorithms written by the 
authors in GAUSS and MatLab. 


2 Univariate linear stochastic 
models: basic concepts 


Chapter 1 has emphasised the standard representation of a financial time 
series as that of a (univariate) linear stochastic process, specifically as being a 
member of the class of ARIMA models popularised by Box and Jenkins 
(1976). This chapter provides the basic theory of such models within the 
general framework of the analysis of linear stochastic processes. As already 
stated in chapter 1, our treatment is purposely non-rigorous. For detailed 
theoretical treatments, but which do not, however, focus on the analysis of 
financial series, see, for example, Brockwell and Davis (1996), Hamilton 
(1994), Fuller (1996) or Taniguchi and Kakizawa (2000). 


2.1 Stochastic processes, ergodicity and stationarity 


2.1.1 Stochastic processes, realisations and ergodicity 


When we wish to analyse a financial time series using formal statistical 
methods, it is useful to regard the observed series, (x,,xX2,...,X7), aS a par- 
ticular realisation of a stochastic process. This realisation is often denoted 
{x} 7 , while, in general, the stochastic process itself will be the family of 
random variables {X;}°., defined on an appropriate probability space. For 
our purposes it will usually be sufficient to restrict the index set T = (— 00,00) 
of the parent stochastic process to be the same as that of the realisation, 
i.e. T= (1, T), and also to use x, to denote both the stochastic process and the 
realisation when there is no possibility of confusion. 

With these conventions, the stochastic process can be described by a 
T-dimensional probability distribution, so that the relationship between a 
realisation and a stochastic process is analogous to that between the sample 
and population in classical statistics. Specifying the complete form of the 
probability distribution will generally be too ambitious a task, and we usually 
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content ourselves with concentrating attention on the first and second 
moments: the T means 


E(x), E(%),..., E(xr) 

T variances 

V(x), V(m),---, V(xr) 
and T(T— 1)/2 covariances 
Cov (x;, x), i<j 


If we could assume joint normality of the distribution, this set of 
expectations would then completely characterise the properties of the sto- 
chastic process. As we shall see, however, such an assumption is unlikely to 
be appropriate for many financial series. If normality cannot be assumed but 
the process is taken to be linear, in the sense that the current value of the 
process is generated by a linear combination of previous values of the process 
itself and current and past values of any other related processes, then, again, 
this set of expectations would capture its major properties. In either case, 
however, it will be impossible to infer all the values of the first and second 
moments from just one realisation of the process, since there are only T 
observations but T+ T(T+ 1)/2 unknown parameters. Hence, further sim- 
plifying assumptions must be made to reduce the number of unknown 
parameters to more manageable proportions. 

We should emphasise that the procedure of using a single realisation to 
infer the unknown parameters of a joint probability distribution is valid only 
if the process is ergodic, which essentially means that the sample moments for 
finite stretches of the realisation approach their population counterparts as 
the length of the realisation becomes infinite. For more on ergodicity, see, for 
example, Granger and Newbold (1986, chap. 1) or Hamilton (1994, chap. 3.2) 
and, since it is difficult to test for ergodicity using just (part of) a single 
realisation, it will be assumed from now on that all time series have this 
property. Domowitz and El-Gamal (2001) have provided a set of sufficient 
assumptions under which a single time series trajectory will contain enough 
information to construct a consistent non-parametric test of ergodicity. 


2.1.2 Stationarity 


One important simplifying assumption is that of stationarity, which requires 
the process to be in a particular state of ‘statistical equilibrium’ (Box and 
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Jenkins, 1976, p. 26). A stochastic process is said to be strictly stationary if its 
properties are unaffected by a change of time origin. In other words, the joint 
probability distribution at any set of times t, b, ... tn must be the same as 
the joint probability distribution at times tı + k, h+k,...,tn+k, where k is 
an arbitrary shift in time. For m=1, this implies that the marginal prob- 
ability distributions do not depend on time, which in turn implies that, so 
long as E|x;|"<oo, both the mean and variance of x, must be constant — i.e. 


B(x) = Els) = +++ = Eler) = Bla) = 
and 
V(x) = V(x) =- = V(xr) = V(x) = 0% 


If m=2, strict stationarity implies that all bivariate distributions do not 
depend on t: thus all covariances are functions only of the time shift (or lag) 
k — i.e. for all k 


Cov(x, x4) = Cov( x, X44) = +++ = Cov(xr-k, xr) = Cov( xt, Xip) 
Hence, we may define the autocovariances and autocorrelations as 

Vk = Cov( Xe, xik) = El(xe — u) (xx — 1] 

and 


= Cov( xt, Xk) _ %k 
[V (æ) V(x 70 


respectively, both of which depend only on the lag k. Since these conditions 
apply just to the first- and second-order moments of the process, this is 
known as second-order or weak stationarity (and sometimes covariance sta- 
tionarity or stationarity in the wide sense). While strict stationarity (with finite 
second moments) thus implies weak stationarity, the converse does not hold, 
for it is possible for a process to be weakly stationary but not strictly sta- 
tionary; this would be the case if higher moments, such as E(x), were 
functions of time. If, however, joint normality could be assumed, so that the 
distribution was entirely characterised by the first two moments, weak 
stationarity does indeed imply strict stationarity. More complicated rela- 
tionships between these concepts of stationarity hold for some types of 
non-linear processes (as is discussed in chapter 4). 
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The autocorrelations considered as a function of k are referred to as the 
autocorrelation function (ACF). Note that, since 


yk = Cov xr, =k) = Cov xik, Xt) = Cov Xp X44) = Y-k 


it follows that p= p— p» and so only the positive half of the ACF is usually 
given. The ACF plays a major role in modelling dependencies among 
observations, since it characterises, along with the process mean u= E(x;) 
and variance o2 = yọ = V(x), the stationary stochastic process describing 
the evolution of x» It therefore indicates, by measuring the extent to which 
one value of the process is correlated with previous values, the length and 
strength of the ‘memory’ of the process. 


2.2 Stochastic difference equations 


A fundamental theorem in time series analysis, known as Wold’s decom- 
position (Wold, 1938: see Hamilton, 1994, chap. 4.8), states that every weakly 
stationary, purely non-deterministic stochastic process (x,—) can be 
written as a linear combination (or linear filter) of a sequence of uncorrelated 
random variables. By ‘purely non-deterministic’ we mean that any linearly 
deterministic components have been subtracted from (x,— u). Such a 
component is one that can be perfectly predicted from past values of itself, 
and examples commonly found are a (constant) mean, as is implied by 
writing the process as (x,— u), periodic sequences, and polynomial or 
exponential sequences in t. A formal discussion of this theorem, well beyond 
the scope of this book, may be found in, for example, Brockwell and Davis 
(1996, chap. 5.7), but Wold’s decomposition underlies all the theoretical 
models of time series that are subsequently to be introduced. 
This linear filter representation is given by 


X — U= a + Yia + Ypa + 0 = Doe Wo =1 (2.1) 


j=0 


The {a,:¢=0,+1 
variables, often known as innovations, drawn from a fixed distribution with 


2,...} are a sequence of uncorrelated random 


E(a;)=0, V(a:) = E(a;) = 0’ <oo 
and 


Cov(a;, at-k) = E(a;a;_~) = 0, for all k #0 
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We will refer to such a sequence as a white-noise process, often denoting it as 
a; ~ WN(0, 0”). The coefficients (possibly infinite in number) in the linear 
filter are known as w-weights. 

We can easily show that the model (2.1) leads to autocorrelation in x, 
From this equation it follows that 


E(x;) = 
Yo = V(x) = E(x — yu) 
= E(a; + Wi ar—1 + Paz Pee )? 
= E(a,) + Ela; 1) + YE (a2) ++ 
=7 +i +y 
a AD OO) ef 2. 
=Q D vi 
by using the result that E(a,_ ;a;_ ;) =0 for 1# j. Now 
Yk = E(x, — u) (x-k — H) 
= E(ay + hiari + +++ + Weare +) ark + Yia- t) 
=O (1 -pk + PPr + PPr +++) 


=0 Saa Pipik 
and this implies 
P Sizo VjVj+k 
k=- 
>20 y; 


If the number of -weights in (2.1) is infinite, we have to assume that the 
weights are absolutely summable, i.e. that 20 |2b;|<oo, in which case the 
linear filter representation is said to converge. This condition can be shown to 
be equivalent to assuming that x, is stationary, and guarantees that all 
moments exist and are independent of time, in particular that the variance of 
Xp Yo is finite. 

Wold’s theorem is fundamental, in that it allows us to represent any 
arbitrary linear process as a stochastic difference equation with infinite lags. 
In practical terms this representation may not seem very useful, since it 
requires the estimation of an infinite number of ¢)-weights in (2.1). As will be 
shown in subsequent sections, however, this infinite-order stochastic dif- 
ference equation can be represented exactly or approximately by a parsi- 
monious ratio of finite-order stochastic difference equation models. 
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2.3 ARMA processes 


2.3.1 Autoregressive processes 


Although equation (2.1) may appear complicated, many realistic models 
result from particular choices of the ~-weights. Taking u = 0 without loss of 
generality, choosing ~;= ¢/ allows (2.1) to be written 


Xp = a, + arı + pa aier 
= a+ ları + parz Aa -) 
= X;-1 + a 


or 


Xt — ỌX—1ı = At (2.2) 


This is known as a first-order autoregressive process, often given the 
acronym AR(1). The backshift (or lag) operator, B, is now introduced for 
notational convenience. This shifts time one step back, so that 


Bx; = x1 
and, in general, 
Bx; = Xt—m 


noting that Bj. = u. The lag operator allows (possibly infinite) distributed 
lags to be written in a very concise way. For example, by using this notation 
the AR(1) model can be written as 


(1 — @B)x, = a 
so that 


x= (1 oB) ‘a, = (1 + B+B + -Ja 


; (2.3) 
= a, + Qa +O aH ++: 


This linear filter representation will converge as long as |¢|<1, which is 
therefore the stationarity condition. 
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We can now deduce the ACF of an AR(1) process. Multiplying both sides 
of (2.2) by x;_,,k > 0, and taking expectations yields 


Vk — OVk-1 = E(a:x%-x) (2.4) 


From (2.3), a:x;-¢ = i Q'atai—k—i. AS a, is white noise, any term in 
4a; k—i has zero expectation if k+ i> 0. Thus, (2.4) simplifies to 


yk = OYe-1, for all k>0 


and, consequently, y= ¢*y. An AR(1) process therefore has an ACF given 
by pp=¢*. Thus, if ¢>0, the ACF decays exponentially to zero, while, if 
@ > 0, the ACF decays in an oscillatory pattern, both decays being slow if ¢ is 
close to the non-stationary boundaries of +1 and —1. 

The ACFs for two AR(1) processes with (a) ¢=0.5, and (b) ¢=—0.5, are 
shown in figure 2.1, along with generated data from the processes with a; 
assumed to be normally and independently distributed with o* = 25, 
denoted a,~ NID(0,25), and with starting value x=0. With ¢>0 (c), 
adjacent values are positively correlated and the generated series has a ten- 
dency to exhibit ‘low-frequency’ trends. With @ <0 (d), however, adjacent 
values have a negative correlation and the generated series displays violent, 
rapid oscillations. 


Pr 
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(a) ¢=0.5 


ACFs and simulations of AR(1) processes 
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2.3.2 Moving average processes 
Now consider the model obtained by choosing p, = —6 and Wj= 0,7 = 2,in (2.1): 
Xp = a, — atı 
or 


xX = (1 — OB)a, (2.5) 


This is known as the first-order moving average, or MA(1), process and it follows 
immediately that 


y= o(1+6), y= —0°0, y=0 for k>1 


and hence its ACF is described by 


cA =0, kl 
RES Ta hk SPR 


Thus, although observations one period apart are correlated, observations more 
than one period apart are not, so that the ‘memory’ of the process is just one 
period: this ‘jump’ to zero autocorrelation at k= 2 may be contrasted with the 
smooth, exponential decay of the ACF of an AR(1) process. 

The expression for pı can be written as the quadratic equation 67; + 0 + 
pı =0. Since 0 must be real, it follows that —} < p; <}. Both 6 and 1/6 will 
satisfy this equation, however, and thus two MA(1) processes can always be 
found that correspond to the same ACF. Since any moving average model 
consists of a finite number of ¢-weights, all MA models are stationary. In order 
to obtain a converging autoregressive representation, however, the restriction 
|0|<1 must be imposed. This restriction is known as the invertibility condition, 
and implies that the process can be written in terms of an infinite autoregressive 
representation 


Xp = M1 Xp + M2%-2 + +A 


where the 7-weights converge — i.e. ae |71)|<oo. In fact, the MA(1) model 
can be written as 

(1 = OB) ' x; = At 

and expanding (1 — 0B)~? yields 


(1+ 6B+ 0B + ---)x =a 
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The weights 7;= —0 T will converge if |0| < 1, i.e. if the model is invertible. This 
implies the reasonable assumption that the effect of past observations decreases 
with age. 

Figure 2.2 presents plots of generated data from two MA(1) processes with 
(a) 0=0.8 and (b) 6=—0.8, in each case with a,~ NID(0,25). On com- 
parison of these plots with those of the AR(1) processes in figure 2.1, it is seen 
that realisations from the two types of processes are often quite similar, sug- 
gesting that it may, on occasions, be difficult to distinguish between the two. 


2.3.3 General AR and MA processes 


Extensions to the AR(1) and MA(1) models are immediate. The general 
autoregressive model of order p, AR(p), can be written as 


Xe — 1X1 — P2Xi-2 — +++ — HpXp—p = a 
or 
(1 ġıB hB” FER PpBP) x, = o(B)x = At 


The linear filter representation x,=7(B)a, can be obtained by equating 
coefficients in ¢(B)ẹy(B) = 1 (see Mills, 1990, chap. 5, for examples of how 
to do this). The stationarity conditions required for convergence of the 
w-weights are that the roots of the characteristic equation 


#(B) = (1 — gB)(1 — gB) -(1 — gB) =0 
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(a) 0=0.8 


Figure 2.2 Simulations of MA(1) processes 
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are such that |g;|<1 for i= 1,2,...,p, an equivalent phrase being that 
the roots g7' all lie outside the unit circle. The behaviour of the ACF is 
determined by the difference equation 


O(B)pr =0, k>0 (2.6) 
which has the solution 

= k k k 
pk = Agi ALG t: + Apg, 


Since |g;|<1, the ACF is thus described by a mixture of damped exponentials 
(for real roots) and damped sine waves (for complex roots). As an example, 
consider the AR(2) process 


(1 = $B — 2B’) x, = a 
with characteristic equation 
(B) = (1— g B)(1— mB) =0 


The roots gı and g are given by 


2,2 = ; Q + (¢; + 4¢,)'/") 


and can both be real, or they can be a pair of complex numbers. For 
stationarity, it is required that the roots be such that |gı|<1 and |@|<1, and 
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it can be shown that these conditions imply the following set of restrictions 
on ¢, and @3: 


dit+oo<l, —¢@ +¢2<1, —1<¢ġ:<1 


The roots will be complex if ¢? + 4¢2 < 0, although a necessary condition for 
complex roots is simply that ¢2 < 0. 

The behaviour of the ACF of an AR(2) process for four combinations of 
(dı, @2) is shown in figure 2.3. If gı and g are real (cases (a) and (b)), the 
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Figure 2.3 ACFs of various AR(2) processes 
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Figure 2.3 (continued) 


ACF is a mixture of two damped exponentials. Depending on their sign, the 
autocorrelations can also damp out in an oscillatory manner. If the roots are 
complex (cases (c) and (d)), the ACF follows a damped sine wave. Figure 2.4 
shows plots of generated time series from these four AR(2) processes, in each 
case with a,~ NID(0, 25). Depending on the signs of the real roots, the series 
may be either smooth or jagged, while complex roots tend to induce 
‘pseudo-periodic’ behaviour. 

Since all AR processes have ACFs that “damp out’, it is sometimes difficult 
to distinguish between processes of different orders. To aid with such 
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discrimination, we may use the partial autocorrelation function (PACF). In 
general, the correlation between two random variables is often due to both 
variables being correlated with a third. In the present context, a large portion 
of the correlation between x, and x,_; may be due to the correlation this pair 
have with the intervening lags x;-1,x;-2,..-,;%;-k+1. To adjust for this 
correlation, the partial autocorrelations may be calculated. 

The kth partial autocorrelation is the coefficient ¢,, in the AR(k) process 


Xt = Der Xi- + Hera +--+ + OkkXt-k + ar (2.7) 


and measures the additional correlation between x, and x,_, after adjust- 
ments have been made for the intervening lags. 

In general, gą can be obtained from the Yule—Walker equations that cor- 
respond to (2.7). These are given by the set of equations (2.6) with p= k and 
i= i» and solving for the last coefficient ¢,;, using Cramer’s rule leads to 


1 pı PR pk2 P 
pı 1 sot Pk-3 P2 
Pk-1 Pk-2 0° °° pı Pk 
Pkk = 
1 Plott) Pk Pki 


pı 1 ott Pk-3 Pk-2 


Pk- Pk2 `: PM 1 
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It follows from the definition of xx that the PACFs of AR processes are of a 
particular form: 


AR(1): du = pi = Q, Okk = 0 for k>1 
P-A 

AR(2): gu = pı, n = i_ 7 j Pkk = 0 for k>2 
1 

AR(3): Pu £ 0, 22 £ 0, sees Ppp $ 0, kk =0 for k>p 


Thus, the partial autocorrelations for lags larger than the order of the process 

are zero. Hence, an AR(p) process is described by 

(i) an ACF that is infinite in extent and is a combination of damped 
exponentials and damped sine waves, and 

(ii) a PACF that is zero for lags larger than p. 

The general moving average of order q4, MA(q), can be written as 


Xt = at — Oiar ek = Ogai—q 
or 
Xe = (1 = 6,B cree are 0,B*) a, = 0(B) a; 


The ACF can be shown to be 


— 0k + Oy Op ee + Og 28, 
1+ 0+ +8 


Pk = , k=1,2,....q4, pk=0, k>q 


The ACF of an MA(q) process therefore cuts off after lag q; the memory of 
the process extends q periods, observations more than q periods apart being 
uncorrelated. 

The weights in the AR(oo) representation 7(B)x,= a; are given by 7(B) = 
6 (B) and can be obtained by equating coefficients of BÍ in 7(B)6(B) = 1. 
For invertibility, the roots of 


(1 — 0,B—---—6,B*) = (1 — h B) --- (1 — hB) = 0 


must satisfy |h;|<1 for i = 1,2,...,q. 

Figure 2.5 presents generated series from two MA(2) processes, again 
using a,~ NID(0,25). The series tend to be fairly jagged, similar to AR(2) 
processes with real roots of opposite signs, and, of course, such MA processes 
are unable to capture periodic-type behaviour. 
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Figure 2.5 Simulations of MA(2) processes 


The PACF of an MA(q) process can be shown to be infinite in extent (i.e. it 
tails off). Explicit expressions for the PACFs of MA processes are compli- 
cated but, in general, are dominated by combinations of exponential decays 
(for the real roots in 0(B)) and/or damped sine waves (for the complex 
roots). Their patterns are thus very similar to the ACFs of AR processes. 
Indeed, an important duality between AR and MA processes exists: while the 
ACF of an AR(p) process is infinite in extent, the PACF cuts off after lag p. 
The ACF of an MA(q) process, on the other hand, cuts off after lag q, while 


the PACF is infinite in extent. 
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2.3.4 Autoregressive moving average models 


We may also consider combinations of autoregressive and moving average 
models. For example, consider the natural combination of the AR(1) and 
MA(1) models, known as the first-order autoregressive moving average, or 
ARMA(1,1), model: 


Xt — X1 = a, — Oay_} 

or 

(1 — @B)x, = (1 — 0B) a. (2.8) 
The -weights in the MA(oo) representation are given by 


_ (1— 6B) 


x; = V(B)a, = ss) (1- 6B) =a + (6-8) ai (29) 


Likewise, the 7-weights in the MA(oo) representation are given by 


_ (1— $B) 
AE re 
n(B)x; = (>: vn) (1 — @B)x, = a 


x = (d- 0) D Olx + a; 


i=1 


The ARMA(1,1) model thus leads to both moving average and auto- 
regressive representations having an infinite number of weights. The 
w-weights converge for |¢|<1 (the stationarity condition) and the 7-weights 
converge for |@|<1 (the invertibility condition). The stationarity condition 
for the ARMA(1,1) model is thus the same as that of an MA(1) model. 
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From equation (2.9) it is clear that any product x,_,a,; has zero 
expectation if k> j. Thus, multiplying both sides of (2.8) by x,; and taking 
expectations yields 


Yk = k1; for k>1 
whilst for k= 0 and k= 1 we obtain, respectively, 
p- on =o — Ol- 0) 


and 


Jı — $% = —00° 


Eliminating o° from these two equations allows the ACF of the ARMA(1, 1) 
process to be given by 


_ (1 = ¢9)(¢ — 9) 
in ee ary 


and 


Pk=pk-1, for k>1 


The ACF of an ARMA(1,1) process is therefore similar to that of an AR(1) 
process, in that the autocorrelations decay exponentially at a rate ¢. Unlike 
the AR(1), however, this decay starts from p, rather than from pọ= 1. 
Moreover, pı Æ ¢ and, since for typical financial series both ¢ and 6 will be 
positive with ¢ > 0, pı can be much less than ¢ if ¢ — 0 is small. 

More general ARMA processes are obtained by combining AR(p) and 
MA(q) processes: 


Xt — PiXp-1 — ++ — PpXt—-p = A — Cia a Oqgar—q 

or 

(1— ġıB— --- — pB?) x, = (1 — 01B — - - - — 6,B%)a, (2.10) 
i.e. 


P(B)x: = O(B)ar 


The resultant ARMA(p,q) process has the stationarity and invertibility 
conditions associated with the constituent AR(p) and MA(q) processes, 
respectively. Its ACF will eventually follow the same pattern as that of an 
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AR(p) process after q—p-+ 1 initial values po, p),..., Pg—p (if q— p <0 there 
will be no initial values), while its PACF eventually (for k> p— q) behaves 
like that of an MA(q) process. 

Throughout this development, we have assumed that the mean of the pro- 
cess, 4, is zero. Non-zero means are easily accommodated by replacing x; with 
x,— pin (2.10), so that in the general case of an ARMA(p,q) process we have 


P(B) (x — u) = O(B)ar 


Noting that ¢(B)w=(1—¢,—--- — p) = (l)u, the model can 
equivalently be written as 


(B)x; = 0o + O(B)a, 


where 0o= $(1) 4 is a constant or intercept. 


2.4 Linear stochastic processes 


In this development of ARMA models, we have assumed that the innovat- 
ions {a,} are uncorrelated and drawn from a fixed distribution with 
finite variance, and hence the sequence has been termed white noise — 
i.e. a~ WN(0,0°). If these innovations are also independent (in which case 
we denote them as being iid), then the sequence is termed strict white noise, 
denoted a~ SWN(0,o7). A stationary process {xġ} generated as a linear filter 
of strict white noise is said to be a linear process. It is possible, however, for a 
linear filter of a white noise process to result in a non-linear stationary 
process. The distinctions between white and strict white noise and between 
linear and non-linear stationary processes are extremely important when 
modelling financial time series, and, as was alluded to in section 2.1.2, will be 
discussed in more detail in chapter 5. 


2.5 ARMA model building 


2.5.1 Sample autocorrelation and partial autocorrelation functions 


An essential first step in fitting ARMA models to observed time series is to 
obtain estimates of the generally unknown parameters, ju,02 and the px. 
With our stationarity and (implicit) ergodicity assumptions, jz and o? can be 
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estimated by the sample mean and sample variance, respectively, of the 


realisation {x;}1: 


T 
x=T! SS Xt 
f=1 
T 


PST YS ax 
t=1 


An estimate of px is then given by the lag k sample autocorrelation 


T 


X eS — X) 
Ts? : 


k=1,2,... 


the set of ns defining the sample autocorrelation function (SACF). 

For independent observations drawn from a fixed distribution with finite 
variance (p,=0, for all k#0), the variance of rg is approximately given by 
T | (see, for example, Box and Jenkins, 1976, chap. 2). If, moreover, T is 
large, V Try will be approximately standard normal, i.e. Tr, © N(0, 1), so 
that an absolute value of rọ in excess of 2T~!/? may be regarded as ‘signifi- 
cantly’ different from zero. More generally, if p= 0 for k> q, the variance of 
Tp for k> q, is 


V(r) = T? (1 +20 ++ 20) 


Thus, by successively increasing the value of q and replacing the p;s by their 
sample estimates, the variances of the sequence 1, 1,... rg can be estimated 
as T~}, T7! (1 + 2r?), NSI N: (1 + 2r? +e 2r) and, of course, these 
will be larger, for k> 1, than those calculated using the simple formula t— 1. 

The sample partial autocorrelation function (SPACF) is usually calculated 
by fitting autoregressive models of increasing order: the estimate of the last 
coefficient in each model is the sample partial autocorrelation, ork. If the 
data follow an AR(p) process, then for lags greater than p the variance of kt 


is approximately T~', so that VT Êr © N(0, 1). 


2.5.2 Model-building procedures 


Given the ry and ġie with their respective standard errors, the approach to 
ARMA model building proposed by Box and Jenkins (1976) is essentially to 
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Table 2.1 ACF of real S&P 500 returns and accompanying statistics 


k Tk s.e.( ry) Q(k) 
1 0.089 0.086 1.08 [0.30] 
2 — 0.146 0.086 4.03 [0.13] 
3 0.063 0.088 4.59 [0.20] 
4 — 0.074 0.089 5.37 [0.25] 
5 — 0.121 0.089 7.45 [0.19] 
6 0.043 0.090 7.72 [0.26] 
7 0.122 0.090 9.86 [0.20] 
8 — 0.064 0.091 10.46 [0.23] 
9 — 0.022 0.092 10.53 [0.31] 
10 0.067 0.092 11.19 [0.34] 
11 — 0.015 0.093 11.23 [0.42] 
12 — 0.112 0.093 13.10 [0.36] 


Note: Figures in [..] give P(x} > Q(k)). 


match the behaviour of the SACF and SPACF of a particular time series with 
that of various theoretical ACFs and PACFs, picking the best match (or set of 
matches), estimating the unknown model parameters (the ¢js, 0;s and o°) and 
checking the residuals from the fitted models for any possible misspecifications. 

Another popular method is to select a set of models based on prior 
considerations of maximum possible settings of p and q, estimate each 
possible model and select that model which minimises a chosen selection 
criterion based on goodness of fit considerations. Details of these model- 
building procedures, and their various modifications, may be found in many 
texts, such as Mills (1990, chap. 8), and hence will not be discussed in detail; 
rather, they will be illustrated by way of a sequence of examples. 


Example 2.1 Are the returns on the S&P 500 a fair game? 

An important and often analysed financial series is the real return on the 
annual Standard and Poor’s (S&P) 500 stock index for the United States. 
Annual observations from 1872 to 2006 are plotted in figure 2.6 and its SACF 
up to k= 12 is given in table 2.1. It is seen that the series appears to be 
stationary around a constant mean, estimated to be 3.59 per cent. This is 
confirmed by the SACF, and a comparison of each of the rą with their 
corresponding standard errors, computed using equation (2.10), shows that 
none is individually significantly different from zero, thus suggesting that the 
series is, in fact, white noise. 
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Real S&P returns (annual 1872-2006) 


We can construct a ‘portmanteau’ statistic based on the complete set of 
rs. On the hypothesis that x, ~ SWN (u, 07), Box and Pierce (1970) show 
that the statistic 


is asymptotically distributed as x? with k degrees of freedom — i.e. Q*(k) © x3. 
Unfortunately, simulations have shown that, even for quite large samples, the 
true significance levels of Q*(k) could be much smaller than those given by this 
asymptotic theory, so that the probability of incorrectly rejecting the null hypo- 
thesis will be smaller than any chosen significance level. Ljung and Box (1978) 
argue that a better approximation is obtained when the modified statistic 


k 


Q(k) = T(T +2) $0 (T- i) AX; 


i=1 


is used. Q(k) statistics, with accompanying marginal significance levels of 
rejecting the null, are also reported in table 2.1 for k=1,...,12, and they 
confirm that there is no evidence against the null hypothesis that returns are 
white noise. Real returns on the S&P 500 would therefore appear to be con- 
sistent with the fair game model in which the expected return is constant, 
being 3.59 per cent per annum. 


Example 2.2 Modelling the UK interest rate spread 


As we shall see in chapter 10, the ‘spread’, the difference between long-term 
and short-term interest rates, is an important variable in testing the 
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Table 2.2 SACF and SPACF of the UK spread 


k ry s.e.(ry) Ôr s.e. (Òr) 


1 0.969 0.039 0.969 0.039 
2 0.927 0.066 — 0.200 0.039 
3 0.886 0.084 0.020 0.039 
4 0.847 0.097 0.007 0.039 
5 0.806 0.108 — 0.059 0.039 
6 0.763 0.116 — 0.052 0.039 
7 0.721 0.124 — 0.003 0.039 
8 0.680 0.130 — 0.012 0.039 
9 0.643 0.135 0.049 0.039 
10 0.612 0.140 0.045 0.039 
11 0.584 0.144 0.006 0.039 
12 0.557 0.147 — 0.013 0.039 
6 dą 
4 4 
a 
z od 
g 
24 
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Figure 2.7 UK interest rate spread (monthly March 1952—December 2005) 


expectations hypothesis of the term structure of interest rates. Figure 2.7 
shows the spread between twenty-year UK gilts and ninety-one-day Treasury 
bills using monthly observations for the period 1952 to 2005 (T= 648), while 
table 2.2 reports the SACF and SPACF up to k= 12, with accompanying 
standard errors. (The spread may be derived from the interest rate series R20 
and RS given in the data appendix). 

The spread is seen to be considerably smoother than one would expect if it 
was a realisation from a white noise process, and this is confirmed by the 
SACF, all of whose values are positive and significant (the accompanying 
portmanteau statistic is Q(12) =4557!). The SPACF has both bu and dn 
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significant, thus identifying an AR(2) process. Fitting such a model to the 
series by ordinary least squares (OLS) regression yields 


xı = 0.032 + 1.180 x,_; — 0. Oxi 2+ â, ô= 0.419 
(0.019) (0.039) (0.038) 


Figures in parentheses are standard errors and the intercept implies a fitted 
mean of fi = 6 / (1 —@- b2) = 0.906, with standard error 0.469. Since 


Qı + bo = 0.965, —d, + bo = —1.396 and Qa = —0.216, the stationarity 
conditions associated with an AR(2) process are satisfied, but, although VA is 


negative, VA + Ady = 0.528, so that the roots are real, being g, = 0.95 
and g = 0.23. The closeness of ĝ; to unity will be discussed further later in 
the chapter. 

Having fitted an AR(2) process, it is now necessary to check whether such 
a model is adequate. As a ‘diagnostic check’, we may examine the properties 
of the residuals a,. Since these are estimates of a, they should mimic its 
behaviour — i.e. they should behave as white noise. The portmanteau stat- 
istics Q* and Q can be used for this purpose, although the degrees of freedom 
attached to them must be amended: if an ARMA(p, q) process is fitted, they 
are reduced to k—p—gq. With k=12, our residuals yield the value 
Q(12) = 7.98, which is now asymptotically distributed as yj, and hence gives 
no evidence of model inadequacy. 

An alternative approach to assessing model adequacy is to overfit. For 
example, we might consider fitting an AR(3) process or, perhaps, an ARMA 
(2,1) to the series. These yield the following pair of models (methods of 
estimating MA processes are discussed in, for example, Hamilton, 1994, 
chap. 5; we use here conditional least squares) 


= 0.032 + 1.183 x,_; — 0.222 x2 + 0.004x,_3 +a, OG =0.417 
(0.019) (0.039) (0.060) (0.039) 


x = 0.034 + 1.091 x1 — 0.129 x2 + a + 0.09241, = 0.419 
(0.021) (0.178) (0.1 74) (0.1 79) 


In both models, the additional parameter is insignificant, thus confirming 
the adequacy of our original choice of an AR(2) process. 

Other methods of testing model adequacy are available. In particular, we 
may construct formal tests based on the Lagrange multiplier (LM) principle: 
see Godfrey (1979), with Mills (1990, chap. 8.8) providing textbook 
discussion. 
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Table 2.3 SACF and SPACF of FTA All Share nominal returns 


k Tk s.e.(1%) ou s.e. (bkk) 
1 0.105 0.045 0.105 0.045 
2 —0.101 0.046 — 0.113 0.045 
3 0.061 0.046 0.087 0.045 
4 0.025 0.047 — 0.004 0.045 
5 — 0.091 0.048 — 0.080 0.045 
6 — 0.009 0.048 0.010 0.045 
7 0.024 0.048 0.003 0.045 
8 — 0.039 0.048 — 0.032 0.045 
9 0.093 0.048 0.112 0.045 

10 0.005 0.049 — 0.040 0.045 

11 — 0.037 0.049 — 0.006 0.045 

12 0.012 0.049 0.006 0.045 


Example 2.3 Modelling returns on the FTA All Share index 

The broadest-based stock index in the United Kingdom is the Financial 
Times-Actuaries (FTA) All Share. Table 2.3 reports the SACF and SPACF 
(up to k= 12) of its nominal return calculated using equation (1.2) from 
monthly observations from 1965 to 2005 (T=491). The portmanteau 
statistic is Q(12) = 23.0, which is significant at the 0.03 level, and both 
rą and bk at lags k= 1 and 2 are greater than two standard errors in size. 
This suggests that the series is best modelled by some ARMA process of 
reasonably low order, although a number of models could be consistent with 
the behaviour shown by the SACF and SPACF. 

In such circumstances, there are a variety of selection criteria that may be 
used to choose an appropriate model, of which perhaps the most popular is 
Akaike’s (1974) information criterion (AIC), defined as 


AIC(p, q) = In 6 + 2(p + q)T' 
although a criterion that has better theoretical properties is Schwarz’s (1978) 
BIC(p, q) = In & + (p+ q)T'InT 


A number of other criteria have been proposed, but all are structured in 
terms of the estimated error variance G* plus a penalty adjustment involving 
the number of estimated parameters, and it is in the extent of this penalty 
that the criteria differ. For more discussion about these, and other, selection 
criteria, see Tremayne (2006). 
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The criteria are used in the following way. Upper bounds, say Pmax and 
max are set for the orders of (B) and @(B), and, with p = {0,1,..., Pmax} 
and q= {0,1,...,4max}, orders pı and qı are selected such that, for 
example, 


AIC(p1, qı) = min AIC(p, q), pE P, q4E€q 


with parallel strategies obviously being employed in conjunction with BIC or 
any other criterion. One possible difficulty with the application of this 
strategy is that no specific guidelines on how to determine p and q seem to be 
available, although they are tacitly assumed to be sufficiently large for the 
range of models to contain the ‘true’ model, which we may denote as having 
orders (po,;qo) and which, of course, will not necessarily be the same as 
(p14), the orders chosen by the criterion under consideration. 

Given these alternative criteria, are there reasons for preferring one to 
another? If the true orders (po,go) are contained in the set (p,q), 
PE Pp, 7€ q then — for all criteria — pı > po and qı > qo, almost surely, as 
T— co. BIC is strongly consistent, however, in that it determines the true 
model asymptotically, whereas for AIC an overparameterised model will 
emerge no matter how long the available realisation. Of course, such prop- 
erties are not necessarily guaranteed in finite samples, as we find below. 

Given the behaviour of the SACF and SPACF of our returns series, we set 
P = 4 = 2, and table 2.4 shows the resulting AIC and BIC values. AIC selects 
the orders (2,2) — i.e. an ARMA (2,2) process — while the orders (0,1) and 
(0,2) have identical (to four decimal places) BIC values, so that an MA(2) 
process is chosen (these models are shown in bold italics in table 2.4). 
The two estimated models are 


x, = 1.81 — 0.959 xı — 0.738 x,_2 
(0.74) (0.119) (0.096) 


+ a, + 1.083 a1 + 0.742 42, 6 = 5.96 
(0.120) (0.103) 


and 


x, = 1.21 +a, + 0.130 a,_; — 0.107 4-2, G = 5.99 
(0.28) (0.045) (0.045) 


Although these models appear quite different, they are, in fact, similar in 
several respects. The estimate of the mean return implied by the ARMA(2,2) 
model is 1.21 per cent, the same as that obtained directly from the MA(2) 
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Table 2.4 Model selection criteria for nominal returns 


q 0 1 2 
P 
AIC 0 6.442 6.432 6.423 
1 6.437 6.427 6.429 
2 6.430 6.429 6.417 
BIC 0 6.451 6.4488 6.4488 
1 6.454 6.453 6.463 
2 6.456 6.464 6.460 


model, while the sum of the weights of the respective AR(oo) representations 
are 0.98 and 0.96, respectively. The short-run dynamics are quite similar as 
well. For the ARMA(2,2) model the initial weights are m= — 0.124, 
m = 0.130, m3 = — 0.049 and 74=— 0.044, while for the MA(2) they are 
Tı = — 0.130, m, = 0.124, 73 = — 0.030 and 74= — 0.017. Both models pro- 
vide acceptable fits to the returns series: the MA(2) has a Q(6) value of 5.2, 
with a marginal significance level of 0.27, while the ARMA(2,2) model has a 
Q(6) value of 2.0, with a marginal significance level of 0.36. 

Thus, although theoretically the BIC has advantages over the AIC, it would 
seem that the latter selects the model that is preferable on more general 
grounds, given that the AIC-selected model nests the BIC-selected model by 
including the two significant autoregressive terms. Nonetheless, we should 
observe that, for both criteria, there are other models that yield criterion 
values close to that of the model selected. Using this idea of being ‘close to’, 
Poskitt and Tremayne (1987) introduce the concept of a model portfolio. 
Models are compared to the selected (p,,q,) process by way of the statistic, 
using AIC for illustration, 


R = exp[-4T{AIC(p,, qı) — AIC(p, q)}] 


Although R has no physical meaning, its value may be used to ‘grade the 
decisiveness of the evidence’ against a particular model. Poskitt and Tre- 
mayne (1987) suggest that a value of R less than v10 may be thought of as 
being a close competitor to (p;,q,), with the set of closely competing models 
being taken as the model portfolio. 

Using this concept, with /10 taken as an approximate upper bound, the 
AIC portfolio contains only the (2,2) model, while the BIC portfolio con- 
tains the (0,0), (0,1), (0,2) and (1,1) models (the model portfolios are shown 
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in italics in table 2.4: using /10 as an upper bound requires a maximum 
difference in values of 0.004 here). 

All these models have similar fits and, although it is difficult to compare 
them using the estimated AR and MA polynomials, their ‘closeness’ can be 
seen by looking at the roots of the characteristic equations associated with 
the (B) and 0(B) polynomials. The estimated models can be written as 


(0, 0) x = 1.21 + a; 

(0,1) x = 1.21 + (1 + 0.14B)a; 

(0,2) x; = 1.21 + (1 + 0.13B — 0.11B°)a;, 
= 1.21 + (1 + 0.40B)(1 — 0.27B) a; 

(1,1) x = 1.21 - — a a, = 1.21 + (1 + 0.60B)(1 + 0.46B) ‘a, 
= 1.21 + (1 + 0.60B)(1 — 0.46B + 0.46° B? — 0.46° B? + ---)a; 
~ 1.21 + (1 + 0.14B — 0.07B’)a; 

(2,2) ere (1 + 1.083B + 0.742B?) y 

(1 + 0.959B + 0.738B2) 
— į 2] LF (0-54 + 0.671) B)(1 + (0.54 — 0.674) B) r 
(1 + (0.48 + 0.71i)B)(1 + (0.48 — 0.711)B) 


Focusing first on the BIC model portfolio, the (0,1) and (0,0) models are 
obtained from the (0,2) model by successively restricting the small moving 
average coefficients to zero, while the (1,1) model is seen to be closely 
approximated by the (0,2). The AlIC-selected (2,2) model looks very 
different from the other models, but differences can be deceiving. The 
autoregressive and moving average roots have moduli that are almost 
identical, 0.859 and 0.861, so that cancellation, or partial cancellation, of 
these “common factors’ can lead to each of the other models. 


2.6 Non-stationary processes and ARIMA models 


The class of ARMA models developed in the previous sections of this chapter 
relies on the assumption that the underlying process is weakly stationary, thus 
implying that the mean, variance and autocovariances of the process are 
invariant under time translations. As we have seen, this restricts the mean and 
variance to be constant and requires the autocovariances to depend only on the 
time lag. Many financial time series are certainly not stationary, however, and, 
in particular, have a tendency to exhibit time-changing means and/or variances. 
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2.6.1 Non-stationarity in variance 


We begin by assuming that a time series can be decomposed into a 
non-stochastic mean level and a random error component 


Xt = Ut + Et (2.11) 


and we suppose that the variance of the errors, €, is functionally related to 
the mean level u, by 


V(x) = V(er) = W (u) 


where h(-) is some known function. Our objective is to find a transformation 
of the data, 9(x,), that will stabilise the variance — i.e. the variance of the 
transformed variable g(x,) should be constant. Expanding g(x;) as a first- 
order Taylor series around p, yields 


g(x) S g(t) + (x — be) g (Her) 


where g(u+) is the first derivative of g(x,) evaluated at u, The variance of 
g(,) can then be approximated as 


Vig(x)] S Viglu) + (x: — ueg (x)] 


Thus, in order to stabilise the variance, we have to choose the transformation 


g(-) such that 


y o 1 
8) Win) 


For example, if the standard deviation of x, is proportional to its level, 
h( u) = H, and the variance-stabilising transformation g(j1,) has then to 
satisfy g'(u:) = u; '. This implies that g(u,) = log(u+), and thus (natural) 
logarithms of x, should be used to stabilise the variance. If the variance of x; 


/? so that g' (fr) = li Thus, since 


oy) = 2! * the square root transformation x; /? will stabilise the variance. 


is proportional to its level, h(i) = p; 


These two examples are special cases of the Box and Cox (1964) class of 
power transformations 
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xà—1 
g(x) = s 


where we note that lim\—o]| (xà — 1) /A] = log(x;). While the use of loga- 
rithms is a popular transformation for financial time series, it is rare for a 
constant variance to be completely induced by this transformation alone. 
Chapter 5 considers various models in which time-varying variances are 
explicitly modelled. 


2.6.2 Non-stationarity in mean 


A non-constant mean level in equation (2.11) can be modelled in a variety of 
ways. One possibility is that the mean evolves as a polynomial of order d in 
time. This will arise if x, can be decomposed into a trend component, given 
by the polynomial, and a stochastic, stationary, but possibly autocorrelated, 
zero mean error component. This is always possible given Cramer’s (1961) 
extension of Wold’s decomposition theorem to non-stationary processes. 
Thus, we may have 


d 
X= pi ter = X Gti + U(B)a, (2.12) 
j=0 
Since 
E(e:) = W(B)E(a,) = 0 
we have 


d 
E(x) = Byte) = Bye 
j=0 


and, as the (3; coefficients remain constant through time, such a trend in the 
mean is said to be deterministic. Trends of this type can be removed by a simple 
transformation. Consider the linear trend obtained by setting d= 1, where, for 
simplicity, the error component is assumed to be a white-noise sequence 


X= bo + pit + a; (2.13) 
Lagging (2.13) one period and subtracting this from (2.13) yields 


Xt — X17 = b1 ta — amı (2.14) 
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The result is a difference equation following an ARMA(1,1) process in which, 
since ¢ = 0 = 1, both autoregressive and moving average roots are unity and 
the model is neither stationary nor invertible. If we consider the first differences 
of x: — Wp say — then 


Wt = Xt — Xt-1 = (1 = B)x; = Ax; 


where A = 1 — Bis known as the first difference operator. Equation (2.14) can 
then be written as 


We = Ax; — By + Aa, 
and w, is thus generated by a stationary (since E(w;) = (3; is a constant), but 
not invertible, MA(1) process. 


In general, if the trend polynomial is of order d, and €, is characterised by 
the ARMA process $(B)e; = 6(B)a;, then 


Afx = (1 — B)4x, 


(obtained by differencing x, d times) will follow the process 
A40(B) 
at 
p(B) 


where ĝo = d!G,4. Thus, the MA part of the process generating Afx, will 
contain the factor A“ and will therefore have d roots of unity. Note also that 


A‘x, = A + 


the variance of x, will be the same as the variance of €, which will be constant 
for all t. Figure 2.8 shows plots of generated data for both linear and quadratic 
trend models. Because the variance of the error component, here assumed 
to be white noise and distributed as NID(0,9), is constant and independent of 
the level, the variability of the two series are bounded about their expected 
values, and the trend components are clearly observed in the plots. 

An alternative way of generating a non-stationary mean level is to consider 
ARMA models whose autoregressive parameters do not satisfy stationarity 
conditions. For example, consider the AR(1) process 


x= PX¢—1 + At (2.15) 


where @> 1. If the process is assumed to have started at time t=0, the 
difference equation (2.15) has the solution 


t 
x, = mo + y Qai (2.16) 
i=0 
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The ‘complementary function’ x9‘ can be regarded as the conditional 
expectation of x, at time t=0 (Box and Jenkins, 1976, chap. 4), and is an 
increasing function of t. The conditional expectation of x, at times t = 
1,2,...,f—2,t—1 depends on the random shocks ao, a1, ...,at—3, 4-2) 
and hence, since this conditional expectation may be regarded as the trend of 
Xp the trend changes stochastically. 

The variance of x; is given by 


g») —] 


V(x) = 0° Pol 


which is an increasing function of time and becomes infinite as t — oo. In 
general, x, will have a trend in both mean and variance, and such processes 
are said to be explosive. A plot of generated data from the process (2.15) with 
@=1.05 and a,~ NID(0,9), and having starting value xọ= 10, is shown in 
figure 2.9. We see that, after a short ‘induction period’, the series essentially 
follows an exponential curve, with the generating as playing almost no 
further part. The same behaviour would be observed if additional auto- 
regressive and moving average terms were added to the model, as long as the 
stationarity conditions are violated. 
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Explosive AR(1) model 


As we can see from (2.16), the solution of (2.15) is explosive if ¢ > 1 but 
stationary if ọ < 1. The case ¢=1 provides a process that is neatly balanced 
between the two. If x, is generated by the model 


Xt = M-1+ a (2.17) 


then x, is said to follow a random walk. If we allow a constant, ĝo, to be 
included, so that 


Xt = Xt-1 + bo + at (2.18) 

then x, will follow a random walk with drift. If the process starts at t= 0, then 
t 

Xt = xo + 100 + Y ai 


i=0 
so that 


[lr = E(x) = xo + tho 


and 


Ve = Cov( x, X-k) = (t — k), k>0 
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Thus, the correlation between x, and x,_, is given by 


E t—k _ /t—k 
Pkt = t(t—k) t 


If t is large compared to k, all p;,, will be approximately unity. The sequence 


of x, values will therefore be very smooth, but will also be non-stationary, 
since both its mean and variance will increase with t. Figure 2.10 shows 
generated plots of the random walks (2.17) and (2.18) with x»=10 and 
a,~ NID(0,9). In part (a) of the figure the drift parameter, 0o, is set to zero, 
while in part (b) we have set 4) = 2. The two plots differ considerably, but 
neither show any affinity whatsoever with the initial value xo; indeed, the 
expected length of time for a random walk to pass again through an arbitrary 
value is infinite. 

The random walk is an example of a class of non-stationary processes 
known as integrated processes. Equation (2.18) can be written as 


Ax; = bo + a 


and so first differencing x; leads to a stationary model, in this case the white- 
noise process a, Generally, a series may need first differencing d times to 
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(b) x= 24+x,-4+4, xo = 10; at ~ NID(0,9) 


(continued) 


attain stationarity, and the series so obtained may itself be autocorrelated. 
If this autocorrelation is modelled by an ARMA (p,q) process, then the model 
for the original series is of the form 


(B) A x; = bo + 0(B) a; (2.19) 


which is said to be an autoregressive integrated moving average process of 
orders p, dand q, or ARIMA(p, d, q), and x; is said to be integrated of order d, 
denoted I(d). 

It will usually be the case that the order of integration, d, or, equivalently, 
the degree of differencing, will be zero, one or, very occasionally, two. Again, 
it will be the case that the autocorrelations of an ARIMA process will be near 
one for all non-large k. For example, consider the (stationary) ARMA(1,1) 
process 


Xe — PXp-1 = Aar — Bay) 
whose ACF has been shown to be 


(C =- ¢0)(¢ — 0) 


— = em. fi k 1l 
pı 1+ — 260 > Pk PPk l; or k > 
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As ¢— 1, the ARIMA(0,1,1) process 
Ax, = a; — Gay) 


results, and all the p% tend to unity. 
A number of points concerning the ARIMA class of models are of 
importance. Consider again (2.19), with 697=0 for simplicity: 


o(B) A‘ x, = 0(B) a; (2.20) 


This process can equivalently be defined by the two equations 


o(B)w, = O(B)a, (2.21) 
and 
w, = OX, (2.22) 


so that, as we have noted above, the model corresponds to assuming that 
Afx, can be represented by a stationary and invertible ARMA process. 
Alternatively, for d> 1, (2.22) can be inverted to give 


x, = St, (2.23) 
where S is the infinite summation, or integral, operator defined by 
S=(1+B+B+---)=(1-B)'=AT 


Equation (2.23) implies that the process (2.20) can be obtained by summing, 
or ‘integrating’, the stationary process d times: hence the term ‘integrated 
process’. 

Box and Jenkins (1976, chap. 4) refer to this type of non-stationary 
behaviour as homogeneous non-stationarity, and it is important to discuss 
why this form of non-stationarity is felt to be useful in describing the 
behaviour of many financial time series. Consider again the first-order 
autoregressive process (2.12). A basic characteristic of the AR(1) model is 
that, for both |¢| <1 and |¢|>1, the local behaviour of a series generated 
from the model is heavily dependent upon the level of x, For many financial 
series, local behaviour appears to be roughly independent of level, and this is 
what we mean by homogenous non-stationarity. 

If we want to use ARMA models for which the behaviour of the process is 
indeed independent of its level, then the autoregressive operator ¢(B) must 
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be chosen so that 

P(B) (x: +c) = o(B)x 

where c is any constant. Thus, 
o(B)c = 0 


implying that ¢(1) =0, so that ¢(B) must be able to be factorised as 
(B) = ¢ı(B)(1 — B) = ¢ı(B)A 


in which case the class of processes that need to be considered will be of the 
form 


o1(B)w, = 0(B)a, 


where w,= Ax, Since the requirement of homogenous non-stationarity 
precludes w, increasing explosively, either ¢,(B) is a stationary operator or 
¢1(B) = ¢2(B)(1 — B), so that ¢2(B)w; = 0(B)a,, where ws = A?x,. Since 
this argument can be used recursively, it follows that, for time series that are 
homogenously non-stationary, the autoregressive operator must be of the 
form ¢(B) A4, where ¢(B) is a stationary autoregressive operator. Figure 2.11 


Figure 2.11 ‘Second difference’ model 
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plots generated data from the model A’x, = a,, where a,~ NID(0,9) and 
Xo = x; = 10, and such a series is seen to display random movements in both 
level and slope. 

We see from figures 2.10(a) and 2.11 that ARIMA models without the 
constant ĝo in (2.19) are capable of representing series that have stochastic 
trends, which typically will consist of random changes in both the level and 
slope of the series. As seen from figure 2.10(b) and equation (2.18), however, 
the inclusion of a non-zero drift parameter introduces a deterministic trend 
into the generated series, since ju, = E(x;) = Go + Oot if we set Go = xo. In 
general, if a constant is included in the model for dth differences, then a 
deterministic polynomial trend of degree d is automatically allowed for. 
Equivalently, if @ is allowed to be non-zero, then 


E(m) = E(Af%) = pw = 80/(1— $1 = 2 =+ = 6p) 


is non-zero, so that an alternative way of expressing (2.19) is as 

Q(B) w, = 0(B)a, 

where Ww, = w — uw. Figure 2.12 plots generated data for A*x, = 2 + a, 
where again a,~ NID(0,9) and xo = x; = 10. The inclusion of the deterministic 


quadratic trend has a dramatic effect on the evolution of the series, with the 
non-stationary ‘noise’ being completely swamped after a few periods. 
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Figure 2.12 ‘Second difference with drift’ model 
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Model (2.19) therefore allows both stochastic and deterministic trends to 
be modelled. When 6) = 0 a stochastic trend is incorporated, while if 6) 4 0 
the model may be interpreted as representing a deterministic trend (a 
polynomial in time of order d) buried in non-stationary noise, which will 
typically be autocorrelated. The models presented earlier in this section 
could be described as deterministic trends buried in stationary noise, since 
they can be written as 


o(B)A4x, = 6(1) Gad! + A40(B) a; 


the stationary nature of the noise in the level of x, being manifested in d roots of 
the moving average operator being unity. Further discussion of the relation- 
ships between stochastic and deterministic trends is contained in chapter 3. 


2.7 ARIMA modelling 


Once the order of differencing d has been established then, since w; = A‘ x, 
is by definition stationary, the ARMA techniques discussed in section 2.5.2 
may be applied to the suitably differenced series. Establishing the correct 
order of differencing is by no means straightforward, however, and is 
discussed in detail in chapter 3. We content ourselves here with a sequence of 
examples illustrating the modelling of ARIMA processes when d has already 
been chosen; the suitability of these choices is examined through examples in 
the subsequent chapter. 


Example 2.4 Modelling the UK spread as an integrated process 
In example 2.2 we modelled the spread of UK interest rates as a stationary, 
indeed AR(2), process. Here we consider modelling the spread assuming that 
it is an I(1) process — i.e. we examine the behaviour of the SACF and SPACF 
of w,= Ax, Table 2.5 provides these estimates up to k= 12 and suggests that, 
as both cut off at k=1, either an AR(1) or an MA(1) process is identified. 
Estimation of the former obtains 


w, = — 0.0045 + 0.199 w, ı +a, G=0.424 
(0.0167) (0.039) 


The residuals are effectively white noise, as they yield a portmanteau statistic 
of Q(12) = 9.9, and the mean of w, is seen to be insignificantly different from 
zero. The spread can thus be modelled as an ARIMA(1, 1,0) process without 
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Table 2.5 SACF and SPACF of the first difference of the UK spread 


k Tk s.e.(rk) tk s.e. kk 
1 0.198 0.039 0.198 0.039 
2 0.015 0.041 — 0.025 0.039 
3 — 0.036 0.041 — 0.036 0.039 
4 0.022 0.041 0.039 0.039 
5 0.037 0.041 0.026 0.039 
6 — 0.005 0.041 — 0.020 0.039 
7 — 0.022 0.041 — 0.016 0.039 
8 — 0.081 0.041 — 0.074 0.039 
9 — 0.087 0.041 — 0.062 0.039 

10 — 0.049 0.042 — 0.016 0.039 

11 — 0.009 0.042 0.000 0.039 

12 0.018 0.042 0.020 0.039 


drift. In fact, fitting an ARIMA(0,1,1) process obtained almost identical 
estimates, with 6 estimated to be —0.196 and G = 0.423. 


Example 2.5 Modelling the dollar/sterling exchange rate 

Figure 2.13 plots daily observations of both the level and first differences of 
the dollar/sterling exchange rate from January 1993 to December 2005, a 
total of 3391 observations. The levels exhibit the wandering movement of a 
driftless random walk: the SACF has rı =0.997, rio =0.971, to =0.946, 
159 = 0.867 and fioo = 0.752 and thus displays the slow, almost linear, decline 
typical of an I(1) process (this is discussed further in chapter 3). The dif- 
ferences are stationary about zero and appear to show no discernible pattern. 
They are very close to being a white-noise process, the only significant 
sample autocorrelations in the first twenty lags being r;=—0.035, 
rıı = —0.044 and rı5 = —0.047, the standard error here being 0.017. 


Example 2.6 Modelling the FTA All Share index 

Figure 2.14 plots monthly observations from January 1965 to December 2005 
of the FTA All Share index and, as expected, shows the series to exhibit a 
prominent upward, but not linear, trend, with pronounced and persistent 
fluctuations about it, which increase in variability as the level of the series 
increases. This behaviour thus suggests a logarithmic transformation to be 
appropriate. The so transformed observations are also shown in figure 2.14; 
taking logarithms does indeed both linearise the trend and stabilise the 
variance. 
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Figure 2.13 Dollar/sterling exchange rate (daily January 1993—December 2005) 
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Table 2.6 SACF and SPACF of the first difference of the FTA All Share index 


k fk s.e.( Tk) bk s.e.(Pxk) 
1 0.091 0.045 0.091 0.045 
2 — 0.091 0.047 — 0.100 0.045 
3 0.050 0.047 0.070 0.045 
4 0.036 0.047 0.015 0.045 
5 — 0.080 0.047 — 0.076 0.045 
6 — 0.001 0.048 — 0.018 0.045 
7 0.021 0.048 0.001 0.045 
8 — 0.038 0.048 — 0.032 0.045 
9 0.087 0.048 0.103 0.045 

10 0.018 0.048 — 0.019 0.045 

11 — 0.042 0.048 — 0.021 0.045 

12 0.010 0.048 0.014 0.045 


Eliminating the trend by taking first differences yields the SACF and 
SPACE shown in table 2.6. Although several low-order rą and dy are sig- 
nificant, they show no discernible pattern. Using a similar procedure to that 
outlined in example 2.3 obtained the following ARMA(2, 2) process: 


Ax; = 1.81 — 0.953 Ax;_; — 0.756 Ax;,_2 
(0.74) (0.124) (0.100) 


+ a; + 1.062 a; + 0.760 a;-2, ô= 5.75 
(0.125) (0.107) 

The implied estimate of u is 0.67, which, since Ax, can be interpreted as the 

monthly growth of the index, implies an annual mean growth rate of 

approximately 8 per cent. Here x, is defined as 100 log(P;), where P, is the 

level of the index. Thus, Ax, = 100log(P;/P;-1), which can be compared 

with the nominal return on the index analysed in example 2.3, 


pe se el hdr OO Py ea cd ee 
BS aoe PVP 


D 
+ tog(1 +2) x Ax, +— 


t 


i.e. the nominal return is equal to the growth of the index plus the dividend 
yield (the sample medians of the three series are 1.5 per cent, 1.1 per cent and 
0.4 per cent per month, respectively). The dividend yield appears to be best 
modelled as an ARMA(1,3) process. Granger and Morris (1976) prove that if 


53 Univariate linear stochastic models: basic concepts 


two independent series y; and y, are ARMA (p,,q;), i= 1,2, processes then 
their sum is an ARMA (p,q) process, where 


p<xptp 
and 
q < max(pı + Q, qı + pr) 


The assumption of independence can be weakened to allow for contempor- 
aneous correlation between the innovations of y,; and y2, which is the case for 
Ax and D/P, so that r, should be generated by an ARMA process with orders 
p<4 and q< 6, which is consistent with what was found in example 2.3. 


2.8 Seasonal ARIMA modelling 


Seasonalities have an important role to play in the analysis of financial time 
series, with applications ranging from the calendar anomalies literature to 
the intraday regularities observed at the micro-level of high-frequency data. 
ARIMA models offer a general framework for explicitly incorporating 
seasonal effects via seasonal differencing and multiplicative seasonal terms. 
This framework includes traditional seasonal adjustment models, such as the 
widely employed X-11 procedure, as special cases. For an extensive treatment 
of modelling seasonal time series, see, for example, Mills (1990, chap. 10). 
Here the application of seasonal ARIMA modelling is demonstrated through 
two examples. 


Example 2.7 Seasonal differencing of intradaily absolute returns 
High-frequency time series have been used extensively in recent years for 
uncovering intradaily regularities. Figure 2.15 shows the autocorrelation 
function of a series of absolute returns for the General Index of the Athens 
Stock Exchange (GIASE). The data, analysed in Markellos, Mills and Sir- 
iopoulos (2003), consists of 12,117 observations sampled at sixty-second 
intervals over the period 1 June to 10 September 1998, a total of seventy-three 
days. Absolute returns are important, since they can be used as a proxy for 
volatility and risk (see, for example, Granger and Ding, 1995). The auto- 
correlation function of GIASE absolute returns follows a strikingly regular 
U-shape pattern, implying a frequency of 166 observations (one day). 
Variations in the peak size of the U-shapes suggest the possibility of additional 
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Figure 2.15 Autocorrelation function of the absolute returns of the GIASE (intradaily, 1 June—10 September 1998) 


day-of-the-week effects. Similar patterns have been reported for several other 
markets, and a number of approaches have been proposed for deseasonalising 
intraday volatility, including seasonal multipliers, time-invariant polynomial 
approximations of market activity and smooth harmonic and polynomial 
function approximations. As shown in Figure 2.16, most of the seasonality in 
volatility is removed if we take 166-lag or one-day differences. 


Example 2.8 Seasonal ARIMA modelling of spot energy price returns 
Energy series are notorious for their strong seasonal components. Figure 2.17 
shows average one-day-ahead spot electricity prices and logarithmic returns 
from the Nord Pool Exchange. The data cover the period 22 March 2002 to 
3 December 2004 (T= 988) and are expressed in €/MWh. The plot of prices 
indicates random walk behaviour with abrupt and violent changes and 
spikes. Returns appear stationary, at least with respect to the first moment. 
The SACF and SPACF, shown in table 2.7, display weekly seasonality, with 
large positive autocorrelations appearing at the seasonal lags (7k,k> 1). 
Although seasonally differenced models can be incorporated into the 
ARIMA framework, the identification of models of this form can lead to a 
large number of parameters having to be fitted and may result in a model 
being difficult to interpret. Box and Jenkins (1976, chap. 9) recommend 
the estimation of the more restricted multiplicative ARIMA model, 
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Table 2.7 SACF and SPACF of Nord Pool spot electricity price returns 


k Tk s.€.(T%) xk s.e.(kx) Q k) 
1 — 0.049 0.032 — 0.049 0.032 2.34 
2 — 0.180 0.032 — 0.183 0.032 34.46 
3 — 0.082 0.033 — 0.105 0.032 41.14 
4 — 0.084 0.033 — 0.136 0.032 48.12 
5 — 0.167 0.033 — 0.235 0.032 75.94 
6 0.031 0.034 — 0.071 0.032 76.89 
7 0.432 0.034 0.361 0.032 262.99 
8 0.079 0.039 0.138 0.032 269.18 
9 — 0.213 0.040 — 0.107 0.032 314.58 
10 — 0.077 0.041 — 0.047 0.032 320.57 
11 — 0.039 0.041 — 0.005 0.032 322.11 
12 — 0.174 0.041 — 0.107 0.032 352.45 
13 0.020 0.042 — 0.055 0.032 352.85 
14 0.376 0.042 0.166 0.032 494.55 
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Figure 2.16 Autocorrelation function of the seasonally differenced absolute returns of the GIASE (intradaily, 1 
June-10 September 1998) 


which includes seasonal autoregressive (SAR) and seasonal moving average 
(SMA) terms. The purpose of these models is to allow the formation of 
products of lag polynomials. For example, estimating a MA(2) model for 
returns yields 
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Figure 2.17 Nord Pool spot electricity prices and returns (daily averages, 22 March 2002-3 December 2004) 
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Ax; = (1 — 0.129B — 0.248B") a, 
Including a multiplicative seasonal MA(7) term produces 


Ax, = (1 — 0.156B — 0.166B’) (1 + 0.305B’) a, 
= (1 — 0.156B — 0.166B° + 0.305B’ — 0.047B° — 0.051B”) a, 


The estimated model is an MA(9) process with non-linear restrictions on the 
coefficients. The best fit to the series was given by 


(1 — 0.982 B) Ax, = ( — 0.110 B — 0.090 B) ( — 0.889 #) a; 


(0.006) (0.032) (0.031) (0.018) 


which is referred to as a multiplicative ARIMA (0,1,2)(1,0,1)7 process. 


2.9 Forecasting using ARIMA models 


Given a realisation {x} 1 q from a general ARIMA (p,d,q) process 
(B)A4x, = 09 + 0(B)a; 


it is often the case that we wish to forecast a future value x7, If we let 


a(B) = $(B)A? = (1 — aB — B? — - - - — apy gBP**) 


then a minimum mean square error (MMSE) forecast, denoted fr,» made at 
time T, is given by the conditional expectation 
fra= E(ayxr+n-1 + azXr4h-2 + +++ + AppaXr+n—p-d + Oo 

tars = Cape = = 6,47+n—q|Xrs XT-1, ns) 


Now 


XT+j j<0 
E(xr4j | xr, xr-1,-.-) = oH jso 


and 


ar; J<O0 
E(ar4j|xr, xr-1,--.) = { ~ i 
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so that, to evaluate fr,» all we need to do is (i) replace past expectations 
(j< 0) by known values, x7; and ar,; and (ii) replace future expectations 
(j> 0) by forecast values, fr; and zero. 

Three examples will illustrate the procedure. Consider first the AR(2) 
model (1 — ¢,B— ¢2B’)x; = 0) + a, so that a(B) = (1 — ¢,B— $B’). 
Here 


XT+h = Pixr+h—1 + b2xXr+hn-2 + Âo + Arsh 

and hence, for h= 1, we have 

fra = bixr + box7-1 + % 

for h=2 

fr2 = bift,1 + d2xr + % 

and for h>2 

frn = Qı fr, n-1 + $2 fr, n-2 + 90 

An alternative expression for fr, „ can be obtained by noting that 


fran = (G1 + o2) fr, n-1 — Q2 (fr. h-1 — JT, h-2) + bo 


from which, by repeated substitution, we obtain 


fra=(¢it dy)" xr 


h-1 h=1 ; 
-Q 5 (G1 + 42} (fr, n1-; — fr, n-2-j) + 9% Ss (di + 2)’ 
j=0 j=0 


where, by convention, we take fr,o = x; and fr,—ı = x7_,. Thus, for stationary 
processes (¢, + $2<1, |¢2|<1), as h > o0, 
9% 

h SS eee e E X: = 
fr, i Qı a by ( +) H 
so that for large lead times the best forecast of a future observation is 
eventually the mean of the process. 

Next consider the ARIMA(0,1,1) model Ax,= (1 — 0B)a,. Here a(B) = 
(1 — B) and so 


XT4h = XT4h-1 + Arsh — Oarsy-1 
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For h=1 we have 

fra = xr — Oar 

for h=2 

fr.2 = fri = xr — Oar 
and, in general, 
frra=frin, h>l1 


Thus, for all lead times, the forecasts from origin T will follow a straight line 
parallel to the time axis and passing through fr, ı. Note that, since 


fr,n = xr — Oar 

and 

aT = (1 = B)(1 = OB) xr 

the h-step ahead forecast can be written as 


frn = (1—9)(1 — 6B) ‘xr 
= (1 — 0) (xr + Ox7_1 + xr- +--+) 


i.e. the forecast for all future values of x is an exponentially weighted moving 


average of current and past values. 
Finally, consider the ARIMA(0,2,2) model A?x, = (1 — 01B — 0:B7)a;, 
with a(B) = (1 — B)’= (1 — 2B + B?) 


XT+h = 2XT+h-1 — XT+h-2 + Artn — A ar+n-1 — O2a7+h-2 


For h=1 we have 


T,1 = 2xr — xp-1 — hiar — har- 
for h=2 

fr,2 = 2ft,1 — xr — har 

for h=3 


fr,3 = 2ft,2 — fra 


The Econometric Modelling of Financial Time Series 


and thus, for h>3 
fr, n = 2fr, n-1 — fr, h-2 


Hence, for all lead times, the forecasts from origin T will follow a straight line 
passing through the forecasts fr,ı and fy. 
The h-step ahead forecast error for origin T is 


eT, h = XT4h — fT, n = Arent Wiargn_1 +: t+ hiar 


where ~),...Wy,_1 are the first h— 1 -weights in ~(B) = (B)0(B). The 


variance of this forecast error is then 


V(er n) =P(1+H4+H4+---+v_) (2.24) 


The forecast error is therefore a linear combination of the unobservable 
future shocks entering the system after time Tand, in particular, the one-step 
ahead forecast error is 


eT,1 = XT, 1 — fri = AT+1 


Thus, for an MMSE forecast, the one-step ahead forecast errors must be 
uncorrelated. h-step ahead forecasts made at different origins will not be 
uncorrelated, however, and neither will be forecasts for different lead 
times made at the same origin (see, for example, Box and Jenkins, 1976, 
appendix A5.1). 

For the AR(2) example given above, we have pı = ¢1, Y2 = ¢; + ¢2 and, 
for j>2, Wj = o1Yj-1 + b2Yj-2. Since we are assuming stationarity, these 


w-weights converge absolutely. As absolute convergence (a Ivjl<oo) 
implies say pj <o, known as square-summability (Hamilton, 1994, 


chap. 3), V (ern) converges to a finite value, which is the variance of the 


process about the ultimate forecast ju. 
For the ARIMA(0, 1,1) model, j= 1 — 0, j= 1,2, ... . Thus we have 


V(er n) =o (1+ (A 1)(1—9)’) 
which increases with h. Similarly, the ARIMA(0,2,2) model has ~)-weights 


given by Yj = 1+ 6,4 j(1— 6 — 02), j=1,2,..., and an h-step ahead 
forecast error variance of 
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V(ern) =o + (h—1)(1+ 02)" 4 = h(h 1)(2h — 1)(1 — 8; — 62) 


+ h(h—1)(1 + 02)(1 — 0, — ta) 


which again increases with h. 

The examples in this section thus show how the degree of differencing, or 
order of integration, determines not only how successive forecasts are related 
to each other but also the behaviour of the associated error variances. 


Example 2.9 ARIMA forecasting of financial time series 
Here we examine the properties of ARIMA forecasts for some of the series 
analysed in the examples of this chapter. 
Example 2.2 fitted an AR(2) model to the UK interest rate spread, yielding 
parameter estimates $, = 1.180, 2 = —0.216, 6) = 0.032 and ĉ = 0.419. 
With the last two observations being xr—ı = — 0.21 and xr= — 0.34, fore- 
casts are obtained as 


fr,ı = 1.180xr — 0.216xp_; + 0.032 = —0.324 
fr,2 = 1.180ff, 1 — 0.216x7 + 0.032 = —0.277 
fr,3 = 1.180fr,2 — 0.216 fr, ı + 0.032 = —0.225 


and so on. As h increases, the forecasts eventually tend to 0.963, the 
sample mean of the spread, although the large autoregressive root makes 
this convergence to the sample mean rather slow. The ~-weights are given by 


Yı = ¢ = 1.180 

Un = Ø + h = 1.176 

ws = & + 2b dy = 1.133 

Wa = $i + 30192 + $} = 1.083 
and, hence, 


wp = 1.180Up_1 — 0.21642 
The forecast error variances are 
V(er,1) = 0.419 = 0.176 
V(er,2) = 0.4197 (1 + 1.1807) = 0.421 
V(er,3) = 0.419" (1 + 1.180° + 1.176") = 0.664 
(er, 4) = 0.419’ (1 + 1.180° + 1.1767 + 1.1337) = 0.890 
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the forecast error variances eventually converging to the sample variance of 
the spread, 4.09. 

If, however, we use the ARIMA(0, 1,1) process of example 2.4 to model the 
spread, with 6 = 0.196 and G= 0.423 (and setting the drift to zero), then our 
forecasts are (using the final residual ay = —0.10) 


fr, = —0.34 — 0.196(—0.10) = —0.320 


and, for h> 1, 
fr, L= fr, 1 = —0.320 


so that there is no tendency for the forecasts to converge to the sample mean 
or, indeed, to any other value. Furthermore, the forecast error variances are 
given by 


V (er n) = 0.423°(1 + 0.646(h — 1)) = 0.179 + 0.116(h — 1) 


which, of course, increase with h rather than tending to a constant. This 
example thus illustrates, within the forecasting context, the radically different 
properties of ARMA models that have, on the one hand, a unit autoregressive 
root and, on the other, a root that is large but less than unity. 

The dollar/sterling exchange rate was found, in example 2.4, effectively to 
be a driftless random walk, which therefore implies, given an end-of-sample 
exchange rate of 1.719, that all future forecasts of the rate are that particular 
value, although the precision of the forecasts produced by the accompanying 
forecast error variance diminishes as the forecasting horizon increases: with 
o estimated to be 0.008, we have V (er, n) = 0.000064. 

In example 2.6 we modelled the logarithms of the FTA All Share index as 
an ARIMA(2,1,2) process. Since 


o(B) = 1 + 0.953B + 0.756B° 
we have 


a(B) = 1 — 0.047B — 0.197B° — 0.756B° 


so that forecasts can be computed recursively by 


fr,ı = 0.047xr + 0.197xr-1 + 0.756xr-2 + 1.81 + 1.062ar + 0.760ar-ı 
fr,2 = 0.047fr, ı + 0.197xr + 0.756xr-1 + 1.81 + 0.760ar 

fr,3 = 0.047fr, 2 + 0.197fr, 1 + 0.756xr + 1.81 

fr,a = 0.047fr,3 + 0.197fr, 2 + 0.756fr,ı + 1.81 
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and, for h>5, 
fr, A 0.047fr, h-1 + 0.197fr, h-2 + 0.756fr, h-3+1.81 


By computing the coefficients in the polynomial (B) = a`! (B) 6(B) as 
p(B) = 1 + 1.109B + 1.009B* + 1.022B° + 1.084B* +--- (2.25) 


and using the estimate ô = 0.0575, forecast error variances can then be 
computed using the formula (2.24); since the series is I(1), these variances 
increase with h. 

Additional interpretation of the nature of these forecasts is provided by 
the eventual forecast function, which is obtained by solving the difference 
equation implicit in the ARIMA(2,1,2) representation of x, at time T+h 
(see, for example, Mills, 1990, chap. 7.3, for a general development, and 
McKenzie, 1988, for further discussion): 


Xrin — 0.047 x744-1 — 0.197 xpi 4-2 — 0.756x741-3 
= 1.81 + ar+n + 1.062a7+;-) + 0.760aT+h-2 


At origin T, this difference equation has the solution 


3 . T+h T+h 
xran = X b fi(h) + 0.640 XO bring t+ XO brim jai 
=] j=T+1 jJ=T+1 


where the ws are as in (2.25) and the functions f,(h),...,f(h) depend upon 
the roots of the polynomial a(B), which are unity and the pair of complex 
roots —0.48 + 0.717. Hence, the solution can be written as 


xran = bo + bP + BY (—0.48 + 0.711)" +00” (—0.48 — 0.711)" 


where 
T+h T+h 
bo = 0.640 >D PT+h-j + F Wr +h—j4j 
j=T+1 j=T+1 
For a given origin T, the coefficients bP, j=1,...,3, are constants 


applying to all lead times h, but they change from one origin to the next, 
adapting themselves to the observed values of x, They can be obtained by 
solving a set of recursive equations containing the f;(h)s, Yn and ar. 
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Since the Ys increase with h, bọ imparts a deterministic drift into Xr+ m 
so that bo + b\” gives the forecasted ‘trend’ of the series. Around this trend is 
a damped sine wave provided by the pair of complex roots, its damping 
factor, frequency and phase being functions of the process parameters (Box 
and Jenkins, 1976, pp. 58-63). These complex roots provide a damped 
cyclical AR(2) component with a damping factor of 0.87 and an average 
period of approximately three months. 
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Univariate linear stochastic 
models: testing for unit roots and 
alternative trend specifications 


One of the earliest problems that faced time series analysts was the modelling 
of long-term persistence, or trends, in the observed data. A major motivation 
for studying the trend properties of time series was the belief that long-term 
components should be removed in order to uncover any remaining short- 
term regularities. Until the 1980s the dominant view was that these prop- 
erties could be well described by the deterministic functions of a time index. 
This approach was pioneered by Jevons in the mid-nineteenth century 
and was popularised fifty years later by Persons with the celebrated ‘Harvard 
A-B-C curve’ methodology of stock market prediction (see Samuelson, 1987, 
and Morgan, 1990). At present, the dominant paradigm in economic and 
financial time series modelling builds upon the random walk model, first 
introduced into finance by Bachelier (1900), where, as we have seen in 
chapter 2, stochastic trends arise from the accumulation of random changes. 

These two approaches constitute the main apparatus for the analysis of 
non-stationary time series — i.e. of time series that, broadly speaking, wander 
without bound and origin and do not have well-defined or finite uncondi- 
tional moments. The first approach deals with trend stationary processes, a 
class of non-autonomous processes, which can be made stationary by 
removing a deterministic trend in the form of a polynomial in time. The 
second approach deals with random walks, which are produced from the 
accumulation (or integration in continuous time) of white-noise random 
variables. Random walks belong to the more general class of unit root or 
integrated processes that arise from the integration of stochastic variables. 
Unit root processes, symbolised as I(d), where d denotes the order of inte- 
gration, can be transformed to stationarity by taking d successive differences; 
they are therefore also known as difference stationary processes. ‘Mixed’ trend/ 
difference stationary processes can be obtained by imposing deterministic 
time trends (drifts) on random walks. Knowing whether the non-stationarity 
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in financial data is due to a time trend or a unit root is very important for 
choosing the appropriate transformation and statistical framework (for a 
comparison of the two processes, see Hamilton, 1994, chap. 15). In general, if 
there is no strong presumption in favour of trend stationarity then the 
random walk should be chosen, since it is likely to cause far fewer econo- 
metric problems (see Mills, 1990, chap. 11). 

A considerable amount of research on random walks in finance, especially 
in option pricing theory, is undertaken in continuous-time using the 
framework of stochastic differential equations and stochastic calculus. The 
continuous-time counterparts of random walks are called Brownian motions 
or Wiener processes, which belong to the family of diffusions. Drifting 
random walks are called generalised Brownian motions/Wiener processes. 
An important generalisation is the class of Ito processes, where the expected 
drift and variance of a generalised Wiener process change as a function of the 
underlying stochastic variable and time. An important result of Ito, known as 
It’s lemma, allows us to calculate explicitly the stochastic differential 
equation that governs arbitrary non-linear functions of an It6 process and 
time (see, for example, Rogers, 1997). 

Although some early studies had provided informal evidence in support of 
the random walk model for economic and financial time series, such as 
Working (1934) for stock and commodity prices and Slutsky (1937) for 
business cycles it was not until Kendall’s (1953) investigation that formal 
statistical evidence was provided to suggest that changes in financial prices 
were unpredictable on the basis of past information. Much subsequent 
research has been devoted to empirical investigations of issues related to 
random walks and predictability. It has been widely reported that returns on 
financial prices are essentially linearly unpredictable on the basis of historical 
information, which is consistent with an early version of the efficient market 
hypothesis in finance (e.g. see Cootner, 1964). The random walk has gained 
its present popularity on the basis of theoretical rather than empirical 
arguments, however. More specifically, it was demonstrated that random 
walks were the empirical consequence of many important theoretical models 
from the rational expectations family (e.g. with respect to financial market 
efficiency, hysteresis models of unemployment, the permanent consumption 
hypothesis, etc.). A considerable amount of research has since been devoted 
to exploring the validity and consequences of random walks in finance under 
rational expectations and market efficiency: for example, futures and stock 
prices (Samuelson, 1965, 1973), dividends and earnings (Kleidon, 1986a), 
spot and forward exchange rates (Meese and Singleton, 1982) and interest 
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rates (Campbell and Shiller, 1987). This research has also concentrated on 
examining the validity of the stationarity assumptions made by rational 
expectations models when solving for the expected future values of funda- 
mental variables. A very important reason for the popularity of the random 
walk model, especially in finance, was the fact that, unlike the trend sta- 
tionary process, it was consistent with the intuition that uncertainty 
increases the further we look into the future — i.e. that risk increases with the 
investment horizon. 

The random walk debate placed many doubts upon empirical and the- 
oretical work based on business cycle theories that distinguished between 
cyclical components and the determinants of the trend rate of growth. It was 
also demonstrated that ‘pure’ random walk behaviour was neither a neces- 
sary nor a sufficient condition for rationally determined financial prices (e.g. 
LeRoy, 1973; Lucas, 1978) and that theoretical models not based on random 
walks may be better approximations than the standard rational expectations 
hypothesis (e.g. the noise trading model of Frankel and Froot, 1988). 
Nevertheless, the majority of theoretical models in the past thirty years, 
especially in finance, have been based on some variation of the random walk, 
rather than on a trend stationary model. 

Notwithstanding this theoretical preference, the focus of this chapter is on 
developing a set of techniques for determining the order of integration of a 
time series. Throughout this development, we emphasise the importance of 
the chosen alternative hypothesis to the null of a unit root, in particular 
whether the alternative is that of a constant mean, a linear trend, a segmented 
trend or a non-linear trend. The importance of these models to finance is 
demonstrated through a sequence of examples. 


3.1 Determining the order of integration of a time series 


As shown in the previous chapter, the order of integration, d, is a crucial 
determinant of the properties that a time series exhibits. If we restrict ourselves 
to the most common values of zero and one for d, so that x, is either I(0) or 
I(1), then it is useful to bring together the properties of such processes. 

If xis I(0), which we will denote x,~ I(0) even though such a notation has 
been used previously to denote the distributional characteristics of a series, 
then, assuming for convenience that it has zero mean, 

(i) the variance of x, is finite and does not depend on t; 
(ii) the innovation a, has only a temporary effect on the value of x,; 
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(iii) the expected length of times between crossings of x= 0 is finite — i.e. x, 
fluctuates around its mean of zero; and 

(iv) the autocorrelations, pp decrease steadily in magnitude for large 
enough k, so that their sum is finite. 

If x,~ I(1) with x9 = 0, then 

(i) the variance of x, goes to infinity as t goes to infinity; 

(ii) an innovation a, has a permanent effect on the value of x, because x; is 
the sum of all previous innovations: see, e.g., equation (2.16); 

(iii) the expected time between crossings of x= 0 is infinite; and 

(iv) the autocorrelations p,— 1 for all k as t goes to infinity. 

The fact that a time series is non-stationary is often self-evident from a 
plot of the series. Determining the actual form of non-stationarity, however, 
is not so easy from just a visual inspection, and an examination of the SACFs 
for various differences may be required. 

To see why this may be so, recall that a stationary AR(p) process requires 
that all roots g; in 


(B) = (1 — giB)(1 — @B)... (1 — gpB) 


are such that |g;|<1. Now suppose that one of them — g;, say — approaches 
one, i.e. gı = 1 — ô, where ô is a small positive number. Recalling the solution 
to the difference equation (2.6), the autocorrelations will then be dominated 
by Aigk, since 


Pk = Aigi + Argy +--+ Apgy S Aigi 


as all other terms will go to zero more rapidly. Furthermore, as g; is close to 


one, the exponential decay A; g* will be slow and almost linear, since 


Agt = Ai(1 — 8)"= A (1 — 8k + 8K —---) = A\(1 — ôk) 


Hence, the failure of the SACF to die down quickly is therefore an indication 
of non-stationarity, its behaviour tending to be that of a slow, linear decline. 
If the original series x, is found to be non-stationary, the first difference Ax, 
is then analysed. If Ax; is still non-stationary, the next difference A*x; is 
analysed, the procedure being repeated until a stationary difference is found, 
although it is seldom the case in practice that d exceeds two. 

Sole reliance on the SACF can sometimes lead to problems of over- 
differencing. Although further differences of a stationary series will themselves 
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be stationary, overdifferencing can lead to serious difficulties. Consider the 
stationary MA(1) process x,= (1 — 0B)a,. The first difference of this is 
Ax; = (1 — B)(1 = OB) a, 
= (1—(1+6)B+ 6B’)a, 
(1 = 0ıB = 6B’) a; 


We now have a more complicated model containing two parameters rather 
than one, and, moreover, one of the roots of the 6(B) polynomial is unity since 
0, +6, =1. The model is therefore not invertible, so that the AR(oo) repre- 
sentation does not exist, and attempts to estimate this model will almost 
inevitably run into difficulties. 

Note also that the variance of x, is given by 


V(x) = yola) = (1+ 8) o? 
whereas the variance of w,= Ax, is given by 
V(w) = yo(w) = (1 + (14+ 6) + @)o? 

= 2(11+0+6")o’ 
Hence, 
V(w) — V(x) = (1+ 6)’o? >0 
thus showing that the variance of the overdifferenced process will be larger 
than that of the original MA(1) process. The behaviour of the sample variances 
associated with different values of d can provide a useful means of deciding 
the appropriate level of differencing: the sample variances will decrease until 
a stationary sequence has been found, but will tend to increase on over- 
differencing. This will not always be the case, however, and a comparison of 


sample variances for successive differences of a series is best employed as a 
useful auxiliary method for determining the appropriate value of d. 


3.2 Testing for a unit root 


3.2.1 An introduction to unit root tests 


Given the importance of choosing the correct order of differencing, it is clear 
that we require a formal testing procedure to determine d. To introduce the 
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issues involved in developing such a procedure, we begin by considering the 
simplest case, that of the zero-mean AR(1) process with normal innovations: 


x= x1 +a, t=1,2,...,T (3.1) 


where a,~ NID(0, o°) and x) =0. The OLS estimate of ¢ is given by 


and, from the algebra of OLS, we have 


= y Xt—14t 


T 
a AE 


If the true value of ¢ is less than one in absolute value, then, from Hamilton 
(1994, p. 216), 


VT (br - $) © N(0,07E(x2,.) *) 

Since 

(e) = (DE dai) = 0? l-e) 
it follows that 

VT (br — 6) “ N(0, (14°) 


from which hypothesis tests concerning ¢ may be constructed. 


(or — 9) 


When ¢ = 1 there is an immediate problem with this result, however, for it 
seems to imply that VT (or — p) has a zero variance, which is not very 
helpful for hypothesis testing! To obtain a non-degenerate asymptotic dis- 
tribution for br when ¢ = 1, we need to scale ($r — 1) by T rather than yT. 
To see why this is so, we need to investigate the distributional properties of 
the two sums making up the ratio 


— Ni Xt—14t 
_ T 
Aici Xf 


When ¢= 1, (3.1) is the random walk 


t 
Xe = ) a 
t s<1 5 


($r — 1) 
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from which it follows that x,~ N(0,0°t). Note also that 
x = (x1 + a)? = Xi + 2x14; + a; 

implying that 

X1 = (1/2) (xp — x71 — ap) 


Thus, the numerator of the ratio can be written as 


yo eae = UD HUD. 


Recalling that x) = 0, we then have 


(ar) Eae- GOG - Ge) (Ee 


xr/(oV/T) is N(0,1), so its square is x2, and T~! X}; a? converges in 
probability to o”. Thus, 


T a 
TTS _ -14 ~ (1/2)0?(X — 1) 


where X ~ x7. 
Since E (x?) = 0’ t, it follows that the expectation of the denominator of 
the ratio is 


T T 
Bl | =o ae (t-1) =o0°(T-1)T/2 
which has to be scaled by T7? in order to converge to a finite value: 
= T 
E[T? 7), 22] = (02/2)(1-1/T) > 0/2 


as T— oo. Hence, 


7 TIST x 
T(r- 1) = AF mn (3.2) 
t=1 *t-1 


has an asymptotic distribution that is a ratio of a (scaled) x? variable to a 


non-standard distribution. But what is this distribution? To answer this 
question, let us consider the limiting distribution of the standardised variable 
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x,/o\T. For this, it is convenient to map the increasing interval from 0 to T 
into the fixed interval [0,1] so that results will be invariant to the actual value 
of T. We thus define the random step function Rr(r) as follows. Denote [rT] 
as the integer part of rT, where r€ [0,1], and define 


Rr(r) = xp (r)/o VT = x1/0 VT 


In effect, the interval [0,1] is divided into T+ 1 parts at r=0,1/T, 2/T,...,1 
and Rz(r) is constant at values of r but with jumps at successive integers. As 
T— œ, Rr(r) becomes increasingly ‘dense’ on [0,1]. In the limit, Rr(r) 
weakly converges to standard Brownian motion (or the Weiner process), W(r), 
denoted 


Rr(r) > W(r) ~ N(0,r) 


Hamilton (1994, chap. 17), Banerjee et al. (1993), and Davidson (2006b) 
provide detailed treatments of this result, which is known as Donsker’s the- 
orem but often referred to as the functional central limit theorem. Three 
implications of the theorem are 


A further implication is that, if f(-) is a continuous functional on [0,1], then, 
through the continuous mapping theorem, 


f(Rr(r)) > fW) 


We are now in a position to derive the asymptotic distribution of the 
denominator of (3.2) (see Phillips, 1987a, theorem 3.1): 


fy ee ery. 4) - eri (r(A) 
= Ee ce [Rr(r)] Par =o? [Reina 
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Note also that, because [W(1)]* is distributed as Xo the numerator of (3.2) 
can be written as 


DDE EA O 


Hence, 


(3.3) 


The denominator of (3.3) must be positive. Since [W(1)]? is xt and the 
probability that a xj variable is less than unity is 0.68, the probability that 
T(r = 1) is negative approaches 0.68 as T becomes large. Thus, in two-thirds 


of samples generated by a random walk, the estimate r will be less than unity. 
Furthermore, when [ W(1)]* is large so will be the denominator of (3.3), implying 
that the limiting distribution of T (or — 1) will be skewed to the left, with 
negative values twice as likely as positive values. Note that, from Phillips (1987a), 


a/ay((waP-1) = | waw) 
so an equivalent expression is 


T($r 1) Jo W(r)dW(r) 


> I 5 
IA [W(r)] dr 


A conventional way of testing the null hypothesis = 1 is to construct the 
t-statistic 


pa (br - 1) ($r- 1) (3.4) 


O; T 1/2 
$r (SS A 


where 

1/2 
PENA eo) Ro 29 
OG Coss a) 


is the usual OLS standard error for ¢,; and s?. is the OLS estimate of o°: 


T 


$= OL brea) /(T-1) 
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The distribution of t, does not have a limiting normal distribution when 
p= 1. To find the appropriate limiting distribution, rewrite (3.4) as 


r= (T/sx)(br—1)(T? OT) 


where, following Fuller (1996), we denote the statistic as t to distinguish it 
from the conventional t-statistic. Substituting from (3.2) yields 


= T 
T Jai Xt-14 
1/2 
= T 2 
sr(T es) 


Ten 


Since s. is a consistent estimator of o°, it then follows from our previous 
results that 


a/DelwWwP-) _ a/2WP-) _ fo Woaw) (3.5) 


o(o IWORA) (IWORA) (R IWOPar) 


An alternative test of the null hypothesis results from the fact that y is a 
superconsistent estimate of ġ = 1. Dividing (3.2) by WT yields 


& TRS X14 
VT(¢r -1) = ao 
Dey 


The numerator converges to (1/2) T~!/20?(X — 1). Since X, being x7, has a 
variance of two, the variance of this numerator is of order T~', so that the 
numerator converges in probability to zero. Hence, 


VT($r —1) S0 


where & denotes convergence in probability (for a formal definition, see 
Hamilton, 1994, chap. 7.1). This result allows T(r — 1), which has the 
limiting distribution given by (3.3), to be used on its own as a test statistic 
without needing to calculate its standard error. These tests, particularly t, are 
known as Dickey—Fuller (DF) tests, after the original analyses in Fuller (1976) 
and Dickey and Fuller (1979). 

To make hypothesis testing operational, the limiting distributions (3.3) 
and (3.5) have to be tabulated, and critical values computed. This is typically 
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done by Monte Carlo simulation (see Hendry, 1995, chap. 3.6, for a dis- 
cussion of Monte Carlo techniques). To tabulate the limiting distribution 
(3.3), for example, a, could be simulated by drawing T pseudo-random 
N(0,1) variates and calculating 


T an one: as) ay 
Ja (i a) 


Repeating this calculation n times and compiling the results into an 


empirical probability distribution will yield a close approximation to the 
limiting distribution. Simulated limiting distributions of T(r — 1) and Tt 
are shown in figures 3.1 and 3.2, using o° =1, T=1000 and n= 25,000. 
These distributions are computed as smoothed functions of the histogram 
constructed from the simulated series using a normal kernel. Silverman 
(1986) provides an excellent reference to this technique, which is available in 
several econometric software packages. The parameter that controls the 
amount of smoothing that is carried out was chosen to show the limiting 
distributions to be as smooth as possible whilst retaining their distinctive 
shapes relative to the standard normal. 

Figure 3.1 shows clearly the skewed limiting distribution of T(r = 1). 
Critical values are available from many sources (see, for example, Hamilton, 


-20 17.5 15 12.5 10 -7.5 -5 -2.5 0 2.5 5 


Simulated limiting distribution of T(¢, — 1) 
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1994, table B.5 case 1, and Banerjee et al., 1993, table 4.1a, as well as the 
original source: Fuller, 1976, p.371). For example, for T as large as it is in 
these simulations, the 5 per cent, 2.5 per cent and 1 per cent critical values 
are —8.1, —10.5 and —13.8, respectively. Figure 3.2 shows that the limiting 
distribution of t is approximately standard normal but shifted to the left by 
roughly 0.3: the large T 5 per cent, 2.5 per cent and 1 per cent critical values 
for t are —1.95, —2.23 and —2.58, rather than the N(0,1) critical values 
of —1.65, —1.96 and —2.33. 

More extensive critical values than those given in standard tables are 
available in most econometric packages. These are obtained using the 
response surfaces computed by MacKinnon (1991). For example, 1 per cent 
critical values of t for a given sample size T can be calculated from 


t01(T) = —2.5658 — 1.960T~' — 10.04T~* 


and response surfaces for 5 per cent and 10 per cent critical values are 
reported in MacKinnon (1991, table 1) and repeated in Banerjee et al. (1993, 
table 7.2). Much more extensive simulations carried out in MacKinnon 
(1996) also allow p-values (i.e. marginal significance levels) to be calculated 
for a wide range of sample sizes. 
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3.2.2 Extensions to the Dickey—Fuller test 


Figure 3.3 


The case discussed in the previous section has the merit of being simple but is 
not particularly realistic, for it implies that the alternative to a driftless 
random walk is a stationary AR(1) process about a zero mean. A more 
sensible alternative would be for the AR(1) process to fluctuate about a non- 
zero mean — i.e. that we have the model 


x =O + oxi +a, t=1,2,...,T (3.6) 


in which the unit root null is parameterised as 6) = 0, = 1. The presence of 
an intercept in (3.6) alters the distribution of the test statistics: rather than 
(3.3) and (3.5), we have (see, for example, Hamilton, 1994, chap. 17.4) 


paas (1/2) (W) —1) — W0) - [> W(r)dr 
fo IWO)Par- (fo Wiar) 


(3.7) 


4 (1/2) ([W(1)’"-1) — W01) - fo W(r)dr 
2) 1/2 
T [W(r)}’dr — (fy W(r)dr) } 


This statistic is denoted T, to emphasise that a non-zero mean is allowed 
for in the regression (3.6). Figure 3.3 presents the simulated distribution of 
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T using the same settings of o°, T and nas before, and with a standard 
normal superimposed for comparison. With the non-zero mean, the dis- 
tribution under the unit root null deviates further from a standard normal 
than when the mean is zero (compare with figure 3.2). The large T5 per cent, 
2.5 per cent and 1 per cent critical values are now —2.86, —3.12 and —3.43, 
and again critical values for other sample sizes can be obtained from response 
surfaces, e.g. 


Tu o1(T) = —3.4335 — 5.999T~! — 29.25T~? 


These statistics actually concentrate on testing the null hypothesis ¢ = 1, 
conditional upon the maintained assumption that 6) = 0. It might seem more 
natural to test the joint hypothesis 69 =0, ¢=1, which can be done by 
constructing, for example, a conventional Wald test. Under this null the 
model is the driftless random walk, so that eee Ax? can be regarded as the 
restricted residual sum of squares, which should be compared to the unre- 
stricted residual sum of squares 


ae a = yo, (xı = ĝo cd $x) 


by constructing the statistic 


a Ax; — DA i?) E 
Eå / (7-2) 


Rather than being distributed as F(2, T— 2), however, the limiting distribution 
of ® is tabulated in Dickey and Fuller (1981) (again, also in Hamilton, 1994, 
table B.7 case 2, and Banerjee et al., 1993, table 4.5(a)). For example, for large 
T, 5 per cent and 1 per cent critical values of ® are 4.59 and 6.43, rather than 
2.99 and 4.60. 

All the simulated distributions have been computed using normal 


innovations. If the innovations are not normal but have finite variance, the 
distributions are valid as long as T is large, which will typically be the case 
with financial applications (the infinite variance case is considered briefly in 
section 7 of this chapter). A more important generalisation is to allow the 
innovations to be serially correlated. Suppose that x, is generated by the 
AR(p) process 


(1 - 6,B- b)B? — +» — PBP ) xi = bo + ar 
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or 


x, = o + ar QiXt-i + at (3.8) 


A more convenient representation is obtained by defining 


= ae p; 


p-l f 
&=-¢ $p i=1,2,...,p—1 


j=i+1 
so that (3.8) can be written, with k= p-— 1, 
k 
Xr = Oo + OX-1 + Ei iA Xii + At (3.9) 


The null of one unit root is thus ¢ = Fi $; = 1. OLS provides consistent 
estimates of (3.8), and a test of ¢=1 can be constructed as 


where se(r) is the OLS standard error attached to the estimate br- This 
statistic is also denoted t, because it has the same limiting distribution as the 
statistic obtained from the AR(1) model (3.6), although it is often referred to 
as the augmented Dickey—Fuller (ADF) test (Dickey and Fuller, 1979; see 
Hamilton, 1994, chap. 17.7, for a detailed derivation). Similarly, the Wald ® 
test has an identical distribution to that obtained in the AR(1) case. On 
the other hand, it is T(r — 1)/(1 — 5; — --- — ôx) that is distributed as 
in (3.7). 

The above analysis has implicitly assumed that the AR order p is known, so 


that we are certain that x, is generated by a pth-order autoregression. If the 
generating process is an ARMA(p,q), then Said and Dickey (1985) show that 
the t,, statistic obtained from estimating the model 


k 
x, =O + PXt-1 + ae iA Xi + at — Ea Bay; 


has the same limiting distribution as that calculated from (3.9). The 
problem here, of course, is that p and q are assumed known, and this is 
unlikely to be the case in practice. When p and q are unknown, Said and 
Dickey (1984) show that, under the null hypothesis of a unit root, the test 
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statistic obtained from (3.8) can still be used if k, the number of lags of Ax, 
introduced as regressors, increases with the sample size at a controlled rate 
of T°. With typical financial data, which do not contain pronounced 
seasonal variation, the results of Schwert (1987) and Diebold and Nerlove 
(1990) suggest that setting k at [T°*°] should work well in practice, where 
[-] again denotes the operation of taking the integer part of the argument. 
This adjustment is necessary because, as the sample size increases, the effects 
of the correlation structure of the residuals on the shape of the distribution 
of t,, become more precise. Any choice of k will involve questions of test size 
and power and of trade-offs between the two, however, and we return to 
this issue in section 4. 


3.2.3 Non-parametric tests for a unit root 


An alternative approach to dealing with autocorrelation in a, and which also 
allows for heterogeneity of variance, has been proposed by Phillips (1987a, 
1987b; see also Phillips and Perron, 1988). Rather than including extra lags of 
Ax, to ensure that the errors in (3.8) are white noise, the idea here is to 
modify the statistics after estimation of the simple model — (3.6), say — in 
order to take into account the effects of autocorrelated errors and to enable 
the same limiting distributions, and hence critical values, to apply. 
Consider again the model 


x =O+Ox,-1+4, t=1,2,...,T (3.10) 


but we now place the following set of conditions on the stochastic process 


{ar}: 


E(a,) =0 for all t (3.11a) 

sup; E(|a;\")<oo for some £ > 2 (3.11b) 

o% = lim r>% E( TSZ) exists and is positive, where Sr = RA at 
(3.11c) 


Condition (3.11b) is sufficient to ensure the existence of the variance and at 
least one higher-order moment of a, Normality entails that all moments of 
finite order exist, and, as we shall see in later chapters, the existence of fourth 


moments is often required when dealing with financial time series. E ( a,|? ) 
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is not assumed to be constant, however, so that heterogeneity is allowed. 
Condition (3.11c) is needed to ensure non-degenerate limiting distributions, 
while (3.11a) is the conventional one of ensuring that all conditional means 
are constant, namely zero. A fourth condition is necessary, which requires 
that 


a, is strong mixing, with mixing numbers q,, that satisfy ee al 2/B < oo 


(3.11d) 


Strong mixing is related to ergodicity, which was introduced briefly in 
chapter 2, and implies ergodicity if a, is stationary, which it need not be. The 
mixing numbers a,,, measure the strength and extent of temporal dependence 
within the sequence a, and condition (3.11d) ensures that dependence 
declines as the length of memory, represented by m, increases. Strong mixing 
allows a considerable degree of serial dependence in a, but there is a trade- 
off between the extent of such dependence and the presence of heterogeneity 
(i.e. the probability of outliers), as is seen by the fact that the same coefficient 
B is present in both conditions (3.11b) and (3.11d). The overall set of 
conditions (3.11), which are described in detail in Phillips (1987a), may be 
characterised by the statement that a, is weakly dependent. 

If a, is stationary in (3.10), then 


Efa a2 oF (aiaj) 


For example, if a, is the MA(1) process a, = €; — 0€;_1, where 


E(e) = 02 


$ 
then 


o5 = 02(1+ 6") — 2026 = o2(1 -8y 


a if a, is white noise will o% equal o°, the variance of a,. In the MA(1) case 
=o AG +6 2); in general it can be defined as 


o? = lim roo T! ae E(a;) 


It is this inequality that necessitates the ‘non-parametric’ corrections to 
the DF statistics proposed by Phillips. For example, rather than T(r — 1), 
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an asymptotically valid test is 


a . A = T 7 -1 
Z(¢) =T= 1) — (1/2) (62, — 6?) [r 2S r= ay 
Here x; = (T — 1) 5E} x, while 62, and 6? are consistent estimates of 


the variances o% and o°, respectively. The latter is simply given by 


A = E 
6 = T! J a’ 
t=1 


where the å, are the residuals from estimating (3.10). The former is typically 
calculated as 


À = To = l T sae 
it GAIT ye w;(£) ahr aay; (3.12) 


The window or kernel function w,(¢) is typically assumed to satisfy the con- 

ditions in Andrews (1991), which ensure that the estimate is positive, with the 

added restriction that w,(¢) =0 for j> £, so that £ acts as a truncation lag. A 

popular choice is the triangular set of lag weights w(¢) =£ — j/(¢ + 1) (Newey 

and West, 1987), but other choices are available and, indeed, there are many 

other estimators of o% (see, for example, Andrews, 1991, and Phillips, 2005a). 
Alternatively, t,, can be adjusted to become 


2171/2 


Z (tu) = ta(G/6s0) — (1/263 — 6) lT? Y, (tr — 4) 


Under the unit root null, these statistics have the same limiting distributions 
as T(r — 1) and T,,, respectively, and hence the same sets of critical values 
may be used. When x, has a zero mean the adjusted statistics are the same 
as Z() and Z(t,,), but with x_, removed; these have the same limiting 
distributions as the zero-mean DF statistics. 

For these non-parametric statistics to become operational, the lag trun- 
cation parameter l has to be set. Phillips (1987a) shows that £ has to increase 
with T, but at a rate slower than T°”°. This does not, however, tell us how to 
set l in practice, and no simple rule has emerged from the Monte Carlo 
investigations of Phillips and Perron (1988), Schwert (1987), Kim and 
Schmidt (1990) and Ng and Perron (1995, 2001). We use the [T°??] rule in 
the examples that follow. 

As has been discussed earlier, the presence of a unit root is often a the- 
oretical implication of models that postulate the rational use of information 
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available to economic agents, and thus unit roots occur in many theoretical 
financial models. For example, variables such as futures contracts, stock prices, 
dividends and earnings, spot and exchange rates, and interest rates should 
all contain unit roots under rational expectations. Unit root tests are thus 
extremely important in the analysis of financial time series. 


Example 3.1 Unit root tests on financial time series 
Examples 2.2 and 2.4 examined two models for the UK interest rate spread, a 
stationary AR(2) process and an I(1) process without drift, while example 2.9 
compared and contrasted the two models. We are now in a position to 
discriminate between the two through an application of a unit root test. The 
fitted AR(2) model 


x; = 0.032 + 1.180 x;— pee 2 F ar 
(0.019) (0.039) (0.039) 


can be equivalently estimated as 


= 0.032 + 0.965 x1 — 0.216 Axı-ı + a 
(0.019) (0.009) (0.039) 


so that, with T= 646, T(r — 1)/(1 — 6;) = 646(0.965 — 1)/(1 + 0.216) = 
—18.6, which is significant at the 2.5 per cent level. Alternatively, 

= (0.965 — 1)/0.009 = —3.89, which is significant at the 1 per cent level, 
this critical value being —3.44, as obtained from the MacKinnon response 
surface. Note that the T, statistic can be obtained directly as the t-ratio on 
X,-1 from rewriting the model again as 


Ax, = 0.032 — 0.035 x; — 0.216 Ax;_1 + a 
(0.019) (0.009) (0.039) 


Regressions such as these are often referred to as Dickey—Fuller regressions. 
The non-parametric T, statistic, computed with ¢=6, is Z(t,,) = —3.69, and 
thus confirms the rejection of a unit root at the 1 per cent significance level, 
while the joint test of 9) =0, @= 1, yields a statistic of & = 7.39, which is 
significant at the 2.5 per cent level. We can therefore conclude that the 
appropriate model for the spread is a stationary AR(2) process. We should 
also note that the estimate of the innovation standard deviation o under the 
unit root null is 0.423, higher than its estimate under the AR(2) model 
(0.419), both of which are reported in example 2.7. This is in accordance 
with our earlier discussion of overdifferencing. 


Figure 3.4 


The Econometric Modelling of Financial Time Series 


A similar approach to testing for a unit root in the dollar/sterling exchange 
rate, the presence of which was assumed in example 2.5, leads to the esti- 
mated equation 


Ax; = 0.0036 — 0.00220 x,_1 
(0.0019) (0.00115) 


Here, with T= 3391, we have T(r — 1) = —7.5, t, =—1.91, = 1.91 and, 
for £=8, Z(t,,) =—1.91. All are clearly insignificant, thus confirming that 
the appropriate model is indeed a random walk. 

Figure 3.4 plots the dividend yield (D/P) of the UK All Share index for the 
period January 1965 to December 2005. Recall that in example 2.6 we stated 
that it appeared to be generated by an ARMA(1,3) process. Although the 
series does not contain a trend, its wandering pattern could be a consequence 
of its being generated by an I(1) process, and hence a unit root test may 
be performed. Since we are by no means certain that the ARMA orders are 
correct, it would seem appropriate to compute an ADF test with the lag 
augmentation order k chosen using the [T0] rule. Since T= 491, this 
sets k=4, from which we obtain t,,=—3.30, which is significant at the 
2.5 per cent level. The non-parametric version confirms this, for, with 4 = 5, 
Z(T,,) =—3.19, which again is significant at the 2.5 per cent level. The joint 
test statistic is ® = 5.49, also significant at the 2.5 per cent level. 
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Although ¢ is estimated to be 0.959, its standard error, 0.012, is sufficiently 
small for us to reject the null of a unit root in favour of the alternative that 
the dividend yield is stationary, a finding that will be shown in chapter 10 to 
be consistent with the implications of one of the most fundamental models 
in finance: that of the present value relationship linking real stock prices and 


dividends. 


3.3 Trend stationarity versus difference stationarity 


In the unit root testing strategy outlined above, the implicit null hypothesis is 
that the series is generated as a driftless random walk with, possibly, a serially 
correlated error. In the terminology of Nelson and Plosser (1982), x, is said to 
be difference stationary (DS) 


Ax; = Et (3.13) 


where €,= 0(B)a, while the alternative is that x; is stationary in levels. While 
the null of a driftless random walk is appropriate for many financial time 
series, such as interest rates and exchange rates, other series often do contain 
a drift, so that the relevant null becomes 


Ax; E 6+ E (3.14) 


In this case, a plausible alternative is that x; is generated by a linear trend 
buried in stationary noise (see chapter 2, section 6) — i.e. it is trend stationary 


(TS) 
X= By + Bit + Et (3.15) 


Perron (1988, theorem 1) shows that neither the t, statistic obtained from 
(3.8) nor its non-parametric counterpart, Z(t,,), are capable of distin- 
guishing a stationary process around a linear trend (model (3.15)) from a 
process with a unit root and drift (model (3.14)). Indeed, rejection of a null 
hypothesis of a unit root is unlikely if the series is stationary around a linear 
trend, and becomes impossible as the sample size increases. 

A test of (3.14) against (3.15) is, however, straightforward to carry out by 
using an extension of the testing methodology discussed above. If the 
parametric testing procedure is used, then (3.9) is extended by the inclusion 
of the time trend t as an additional regressor, 


Figure 3.5 
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k 
Xt = Bo + Bit + px 4 i ôiAXi—i + ar (3.16) 
and the statistic 
soa 
k a $ 
se($r) 


is constructed. This ‘t-statistic’ is denoted t, to distinguish it from t,, 
because it has a different limiting distribution, this time given by 


(1/2([W(1)]?—1) — W(1) fo W(r)dr + A 


{i woar- (p woare} 


T 


where 


A=12 if rW(r)dr — (1/2) [ woall f W(r)dr — cu/2)w() 


and 


E 
= (f woar) 


It is perhaps more informative to observe the simulated limiting distri- 
bution shown in figure 3.5, once again computed with o* = 1, T= 1000 and 


0.4 F 


Simulated limiting distribution of t, 
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n= 25,000, and here with drift 6 = 1. The large T5 per cent, 2.5 per cent and 
1 per cent critical values are now —3.41, —3.66 and —3.96, and, again, critical 
values for other sample sizes can be obtained from the MacKinnon response 
surfaces — e.g. 


Tz,01(T) = —3.9638 — 8.353T~' — 47.44T~* 


If the non-parametric approach is employed, then a time trend may be 
added to (3.10) and the analogous adjusted t-ratio can again be compared to 
the t, distribution. The adjustment in this case is 


Z(tr) = tr(67/G5p) — (1/2) (GG — 7) T? [4651(3 B] o 


Dy = (TT? O 08, - D 
FIT HY Sta Dome — (MT H+ YETHY/O(Y x1) 


is the determinant of the regressor moment matrix of the time trend aug- 
mented regression (3.10). 

Strictly, the unit root null requires not only that @= 1 in (3.16) but also 
that 6, =0, because, if 8B; £0, x, will contain a quadratic trend. This is easily 
seen if we set p= 1 for simplicity, for then (3.16) can also be written as 


t E Boe pay t = 
= By) PO I+B DOT + aot 


Under the null ¢= 1, this becomes 
t ” f 
x = Bot + BD) ait) = Bot + pittt 1)/2+S, 


A quadratic trend might be thought unlikely, because, if, for example, x; is in 
logarithms, a non-zero 6; under the null would imply an ever-increasing (or 
decreasing) rate of change Ax, Nevertheless, a parametric joint test of ¢= 1 
and 6, =0 is given by the conventional Wald test, comparing the residual 
sum of squares from the regression of (3.16) with the residual sum of squares 
from the ‘restricted’ regression of Ax, on an intercept and k lags of Ax, 
Rather than having an F-distribution, the statistic has a non-standard dis- 
tribution whose critical values are given, for example, in Hamilton (1994, 
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table B.7 case 4) or Banerjee et al. (1993, table 4.5(c)). A non-parametric 
version of this test is given in Perron (1988). 

It is always possible, of course, that 8; could be non-zero. If this were the 
case, the quadratic trend in x, would dominate the J(1) component S, so that 
the variance of x, when appropriately normalised, converges to a constant: 
from Dolado, Jenkinson and Sosvilla-Rivero (1990), T>? Y x? = £? /20. 
As Banerjee et al. (1993, chap. 6.2.1) show, this implies that t,, for example, 
is asymptotically normal rather than converging to a non-standard distri- 
bution. Similarly, if 8; = 0 but By #0 in (3.15), T? X x? = B5/3 and T, is 
asymptotically normal. 

To circumvent such complications, Dolado, Jenkinson and Sosvilla-Rivero 
(1990) propose the following strategy for testing for unit roots in the pres- 
ence of possible trends. Equation (3.16) is first estimated and T, used to test 
the null hypothesis that ø = 1. If the null is rejected, there is no need to go 
further and the testing procedure stops. If the null is not rejected, we test for 
the significance of 6, under the null — i.e. we estimate 


k 
Ax = Bo + B,t + ee AXi + At 


and test whether £; is zero or not using conventional testing procedures. If 6 
is significant, we compare Tt, with the standard normal and make our 
inference on the null accordingly. If, on the other hand, £; is not significant, 
we estimate (3.16) without the trend (8, =0), 


k 
x = Po + xii + X, diAmit a 


and test the unit root null of ¢= 1 using T,. If the null is rejected, the testing 
procedure is again terminated. If it is not rejected, we test for the significance 
of the constant Bo under the null using the regression 


k 
Ax; = Bo + ee ôA Xii +a 


If Bo is insignificant, we conclude that x, contains a unit root, while, if By Æ 0, 
we compare t,, with the standard normal, again making our inference 
accordingly. 

This procedure is, of course, based on the asymptotic normality of t,; and 
T, in the presence of a trend or drift in the relevant unit root null. An 
interesting question is: what happens in small samples? Both Hylleberg and 
Mizon (1989) and Schmidt (1990) present evidence that, when the drift 
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parameter fy is small compared to o”, the small sample distribution of t, is 
very much closer to the Dickey—Fuller distribution than to the standard 
normal. Schmidt (1990) and Banerjee et al. (1993) present tabulations of the 
T, distribution for various values of the ‘standardised drift’ Bo/o; it is only 
when this ratio exceeds 0.5 that the critical values are closer to the normal 
then to the DF distribution. We should, therefore, be careful when applying 
the asymptotic normality result in the testing strategy outlined above. 


3.4 Other approaches to testing for unit roots 


Many alternative unit root tests have been developed since the initial tests 
outlined above, and surveys may be found in, for example, Park (1990), Stock 
(1994), Maddala and Kim (1998) and Phillips and Xiao (1998). A recurring 
theme is the low power and severe size distortion inherent in many tests (see, 
for example, Haldrup and Jansson, 2006). For example, the non-parametric 
Zp) and Z(t,,) tests suffer severe size distortions when there are moving 
average errors with a large negative root and, although the parametric coun- 
terparts are more accurate in this respect, the problem is not negligible even 
here (see Phillips and Perron, 1988, Schwert, 1989, and Perron and Ng, 1996). 
Moreover, many tests have low power when the largest autoregressive root 
is close to but nevertheless less than unity (DeJong et al., 1992). 

A related issue when dealing with financial time series is that, unlike many 
hypothesis testing situations, the power of tests of the unit root hypothesis 
against stationary alternatives depends less on the number of observations 
per se and more on the span of the data. For a given number of observations, 
the power is largest when the span is longest. Conversely, for a given span, 
additional observations obtained using data sampled more frequently lead 
only to a marginal increase in power, the increase becoming negligible as 
the sampling interval is decreased: see Shiller and Perron (1985), Perron 
(1991) and Pierse and Snell (1995). Hence, a data set containing fewer 
annual observations over a long time period will lead to unit root tests 
having higher power than those computed from a data set containing 
more observations over a shorter time period. This is of some consequence 
when analysing financial time series, which often have a large number of 
observations obtained by sampling at very fine intervals over a fairly short 
time span. 

We thus focus here on recent tests that explicitly concentrate on 
improving power and reducing size distortion. Elliott, Rothenberg and Stock 
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(1996; see also Stock, 1991, 1994) appeal to local-to-unity asymptotic 
approximation theory, where the stationary alternative to a unit root is 
expressed as ġ = 1 + c/T, to propose the DF-GLS (generalised least squares) 
test, in which the data are detrended prior to running a Dickey—Fuller 
regression such as (3.9) or (3.16). The test is computed in two steps. If we let 
z(0) = 1 and z,1) = (1,9), then the first step involves obtaining estimates of 
B(O) = Bo or B(1) = (Bo, Bi) by regressing the quasi-differences 


x4() = [x1,%) — Ox1,...,x7 — bx)! 


where | denotes the vector/matrix transpose operator, on 


zi ($, k) = [z (k), 22(k) — bai(k), ..., zr(k) — bzri(K)]' 


where k=0,1 and ¢ = 1+ ¢/T. Denoting the resulting estimator (k), the 
‘locally detrended’ series 


Xle, k) = X4 — 2:(k)B T (k) 


is then computed. For the second step, the regression (3.16) is then run using 
x,(@, k) but without an intercept or time trend. Elliott, Rothenberg and 
Stock (1996) suggest choosing ¢ = —13.5 if k=1 and provide critical 
values for the t-statistic on x;—1. If no trend is to be included (k= 0) then 
¢ = —7.5 and the t-statistic has the t distribution. 

Elliott, Rothenberg and Stock also propose a point optimal unit root test — 
i.e. the most powerful test of a unit root against a simple point alternative. If 
we define the sum of squared residuals from the first step regression as S(ġ), 
then the point optimal test of the null @ = 1 against the alternative that 


¢ = ¢ is then defined as 
P, = (S(#) — S(1))/6% 


where age is the estimator of o$ shown in (3.12). Critical values of P, are 
provided by Elliott, Rothenberg and Stock, with the null of a unit root being 
rejected if the calculated value of the test statistic is too small. 

Ng and Perron (2001) construct four tests that are based upon the GLS 
detrended data xf . It is useful to define the term 


k=T? Pa (x19, k))’ 
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whereupon the test statistics are defined as 
MZ"() = (T! (xp(@, K)) = 6%) / (2%) 
MSB’ = («/6?,)? 

MZ! = MZ? ($) x MSB’ 


and 


The genesis of MZ4(¢) and MZ? are as modified versions of the non- 
parametric Z(@) and Z(t,,) tests. Perron and Ng (1996) define the “M-tests’ 
to incorporate the feature that a series converges in distribution with dif- 


ferent rates of normalisation under the null and alternative hypotheses. For 
example, the M-modified Z() test is 


MZ($) = Z$) + (T/2)(6 = 1)" 


MZ*(@) is then defined using the GLS detrended data, with the MP4(k) 
statistics being similarly modified versions of the point optimal P, test. The 
MSB’ statistic is a modified version of Bhargava’s (1986) R; statistic, and 
critical values for all these statistics are tabulated as table 1 of Ng and Perron 
(2001). On the stationary alternative, MSB’ tends to zero, so that the unit 
root null is rejected when the statistic is below the critical value. 

All the non-parametric statistics have been defined using 6%,, given by 
(3.12), which may be interpreted as an estimator of the residual spectral 
density at frequency zero. An alternative estimator is given by 


65 ar = 62/(1 — ôi —-+- — Ôx) 


where om is a consistent estimate of the residual error variance from, say, the 
Dickey—Fuller regression (3.16) and the ô; are the estimated lag coefficients. 
This is often referred to as the autoregressive estimate of 0%. 

A related development is the set of confidence intervals for the largest 
autoregressive root provided by Stock (1991). Stock assumes that ¢=1-+ 
c/T and then uses the local-to-unity asymptotic distribution theory to 
construct asymptotic confidence intervals for @ based on computed t, and 
T, Statistics. Since the distributions of these statistics are non-normal and the 
dependence on c is not a simple location shift, such confidence intervals 
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cannot be constructed using a simple rule such as ‘+2 standard errors’. The 
intervals are highly non-linear, exhibiting a sharp ‘bend’ for c just above zero 
(see Stock, 1991, figs. 3.1, 3.2): for positive values of the test statistics the 
intervals are tight, for large negative values they are wide. 

Stock provides tables from which confidence intervals for ¢ can be cal- 
culated given a value of t,, or T, and the sample size T. As an illustration of 
such a calculation, recall that the T, statistic for the UK interest rate spread 
was reported in example 3.1 to be —3.89. From part A of table A.1 in Stock 
(1991), such a value corresponds to an approximate 95 per cent confidence 
interval for c of (—42.47, —10.70). Since the statistic was computed from a 
sample size of T= 646, this corresponds to an interval for @ of (1 — 42.47/ 
646, 1 — 10.70/646) — i.e. 0.934 < 6 < 0.983. Since $ = 0.965, this shows the 
complicated nature of the relationship between @ and the confidence interval 
constructed by ‘inverting’ the t statistic; the point estimate is not, and 
generally will not be, at the centre of the interval. Nevertheless, unity is 
excluded from the interval, thus confirming our choice of a stationary 
process for modelling this series. 

Throughout this development of unit root testing procedures, the null 
hypothesis has been that of a unit root, with a stationary hypothesis (either 
trend or level stationarity) as the alternative. How might we go about testing 
the null of stationarity against a unit root alternative? This has been con- 
sidered by Kwiatkowski et al. (1992) and Leybourne and McCabe (1994, 
1999), and, in a related fashion, by Tanaka (1990) and Saikkonen and 
Luukkonen (1993). Consider again the ARIMA(0, 1,1) process 


Ax; = bo + a — Qapı (3.17) 


As was pointed out in chapter 2, a TS process is obtained if 6 = 1, so that this 
restriction parameterises the trend stationary null, with the unit root alter- 
native being that 6 < 1. Equivalently, the null of 0 = 1 may be regarded as a 
case of overdifferencing. 

The statistic that has been proposed to test this null when the a, are 
strict white noise and normal, often referred to as the KPSS test (after 
Kwiatkowski, Phillips, Schmidt and Shin), is 


= T Bb girs 
n =T í Se S?/6; 
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and 

T 
Ey urn] 2 
Ge = T ee G 


Kwiatkowski et al. (1992) show that the limiting distribution of 7, is 
1 
1, => | V(r} dr 
0 


where V,(r) is a demeaned and detrended Brownian motion process, also 
known as a second-level Brownian bridge, given by 


V(r) = W(r) — (3r? — 2r) W(1) + (6r° — 6r) 1 W(s)ds 


On the null of 6 = 1, 7, = 0, while, under the alternative, n+ > 0 (Kwiatkowski 
et al., 1992, show that the test is consistent: a test is consistent if the probability 
of rejecting a false null goes to one as T — oo). Upper tail critical values of 
n, are reported in table 1 of Kwiatkowski et al.: the 5 per cent critical value is 
0.146, while the 1 per cent value is 0.216. 

If there is no trend in x, under the null then the residuals are defined as 
e, = x, — X. The level stationarity test statistic is then denoted as n,,, whose 
limiting distribution is 


1 
m= f V(r} dr 
0 


Here V(r) =W(r) — rW(1) is a demeaned Brownian motion process, a Brownian 
bridge. Upper tail critical values of n, are also reported in table 1 of Kwiatkowski 
et al.: 5 per cent and 1 per cent critical values are 0.463 and 0.739, respectively. 

Of course, restricting œ, to be strict white noise will typically be 
inappropriate, but extensions are readily available if we assume the weak 
dependence conditions (3.11) and replace 6? by an estimator, say of the form 
of (3.12): 


A2 YT 22 -1 Ç~ T 4. 


These statistics, which we denote ,,(¢) and n,(£), have the same limiting 
distributions as 7,, and n+. Leybourne and McCabe’s (1994, 1999) variant of 
this test corrects for any serial correlation by considering an ARIMA(p, 1, 1) 
process rather than (3.17). 
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The model (3.17) can also be used to illustrate the important concept of 
observational equivalence. Assuming 6o = 0 for simplicity, the model can be 
written as 


Xt = at + (1 = 0) S1 


For large t, x, will be dominated by the unit root component (1 — 9)S,_ 1. If@ 
is close to one, however, then in a finite sample x, will behave essentially like 
the white-noise process a, In such circumstances, unit root tests will almost 
always reject the unit root null, even though it is true! As Cochrane (1991) 
points out, in general any TS process can be approximated arbitrarily well 
by a unit root process, and vice versa, in the sense that the ACFs of the two 
processes will be arbitrarily close. 

While attempting to discriminate between the two classes of processes in 
such circumstances might well be regarded as impossible, distinguishing 
between the two can yield important advantages. Campbell and Perron 
(1991) argue that near-stationary integrated processes (0 close to one) seem 
to be better forecast using stationary models (6 = 1), while near-integrated 
stationary models, such as 


Xp = OX-1 + ay — Oa; 


with ¢ close to, but less than, one, are better forecast using integrated models. 
It may also be better to use integrated asymptotic theory to approximate finite 
sample distributions for near-integrated stationary models, and stationary 
asymptotic theory for near-stationary integrated models. 

We should emphasise that all these testing procedures rely on classical 
methods of statistical inference. An alternative Bayesian methodology has 
also developed from the work of Sims (1988), Sims and Uhlig (1991), DeJong 
and Whiteman (1991a, 1991b) and Koop (1992). Special issues of the Journal 
of Applied Econometrics (1991, volume 6, number 4), Econometric Theory 
(1994, volume 10, number 4/5) and the Journal of Econometrics (1995, 
volume 69, number 1) are devoted to Bayesian developments in unit root 
testing, but it seems fair to say that, perhaps because of both the technical 
and computational complexity of the Bayesian methodology, the classical 
approach remains the most convenient for the applied practitioner to adopt. 


Example 3.2 Are UK equity prices trend or difference stationary? 
In example 2.6 we modelled the logarithms of the UK FTA All Share index as 
an ARIMA(2,1,2) process on noting that it had a pronounced tendency to 
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drift upwards, albeit with some major ‘wanderings’ about trend. We may 
thus investigate whether this DS representation is appropriate or whether a 
TS model would be preferable. Let us first test the null hypothesis that the 
series contains a unit root against the alternative that it is generated as 
stationary deviations about a linear trend. 

Following the testing strategy outlined in section 3.1.5 requires 
estimating the following regressions (with absolute t-ratios now shown in 
parentheses): 


3 
(i) Ax, = 0.073 + 0.00016 t — 0.0210 x,_1 + ` AXi + Âr 
(2.47) (2.21) (2.31) = 


3 
(ii) Ax, = 0.0072 — 0.000002 t + X AXi + a 
(0.90) (0.12) = 


3 
(ili) Ax, = 0.0161 — 0.00156 x,_; + ye Axi + à 
(1.10) (0.69) a 


3 
(iv) Ax, = 0.0063 Axi + 4 
Xt O35) ate 2 Xi + Ar 


From regression (i) a T, test cannot reject the DS null, while 6, is found to be 
insignificant under this null from regression (ii). This necessitates estimating 
regression (iii), from which a T, test still cannot reject the null. Estimating 
equation (iv) shows that fo is non-zero under the null, so that t, strictly 
should be tested against a standard normal. Since t, = —0.69, however, this 
does not alter our conclusion that a unit root in x, cannot be rejected. Note 
that the implied estimate of ¢ from regression (i) is 0.979, and, with T= 488, 
the associated 95 per cent confidence interval, calculated using part B of 
table A.1 of Stock (1991), is 0.999<@< 1.011. 

The DF-GLS test statistic, computed using three lags, is —2.13, which is 
again insignificant, the 10 per cent critical value being —2.57. The Elliott, 
Rothenberg and Stock (1996) point optimal test is P,=9.61, which is 
insignificant as the 10 per cent critical value is 6.89. The Ng and Perron (2001) 
tests produce the following statistics: MZ4(¢) = —9.52, MSB? = 0.228, 
MZ? = —2.17 and MP%(1) = 9.64, all of which are insignificant as the 
10 per cent critical values are —14.2, 0.185, —2.62 and 6.67, respectively. In 
each of these, 6%, was used as the estimate of the zero-frequency spectral 
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density with £ = 3. Varying the setting of £ and the weight function œ;(£) did 
not alter these inferences and nor did using the alternative estimator 6 4p- 

If we assume the null of trend stationarity, we find n,(17) = 0.248, which, 
as the 1 per cent critical value is 0.216, represents a clear rejection and thus 
confirms that equity prices do follow an I(1) process with drift, so that the 
model estimated in example 2.6 is indeed the appropriate one. 


3.5 Testing for more than one unit root 


The above development of unit root tests has been predicated on the 
assumption that x, contains at most one unit root — i.e. that it is at most I(1). 
If the null hypothesis of a unit root is not rejected, then it may be necessary to 
test whether the series contains a second unit root — i.e. whether it is I(2) and 
thus needs differencing twice to induce stationarity. Unfortunately, the 
‘standard’ testing procedure on non-rejection of a unit root in the levels x, 
that of testing whether the differences Ax, contain a unit root, is not justified 
theoretically, as DF-type tests are based on the assumption of at most one 
unit root. If the true number of unit roots is greater than one, the empirical 
size of such tests is greater than the nominal size, so the probability of finding 
any, let alone all, unit roots is reduced. 

Dickey and Pantula (1987) propose a sequence of tests that does have a 
theoretical justification when we assume that x, may contain more than one 
unit root. For example, suppose we assume that x, contains a maximum of 
two unit roots. To test the null hypothesis of two unit roots against the 
alternative of one, we compare the t-ratio on $, from the regression 


A’ xe = Bo + By Axt-1 + a 


with the t,, critical values. If the null is rejected, we may then test the 
hypothesis of exactly one unit root against the alternative of none by com- 
paring with t,, the f-ratio on 6, from 


A? x; = By + Byxt-1 + BAX + a 


Example 3.3 Do UK interest rates contain two unit roots? 
Figure 3.6 shows plots of the UK short and long interest rates from which the 
spread, analysed in example 3.1, was calculated. To test for the presence of at 
most two unit roots we first estimate, under the null hypothesis of exactly 
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two unit roots, the regressions 


A? RS; = 0.004 — 0.712 ARS; 
(0.019) (0.038) 


and 


A*R20, = 0.0002 — 0.721 AR20;_; 
(0.012) (0.038) 


where RS, and R20, are the short and long rates, respectively, and standard 
errors are shown in parentheses. The T, statistics are computed to be — 18.74 
and —18.97, respectively, thus conclusively rejecting the hypothesis of two 
unit roots in both series. On estimating the regressions 


A? RS; = 0.112 — 0.015 RS;_; — 0.706 ARS,_1 
(0.043) (0.005) (0.038) 


and 


A? R20, = 0.047 — 0.006 R20,_; — 0.718 AR20,_ 
(0.032) (0.004) (0.038) 


however, we find that the T, statistics from the estimates of 8; are —2.76 and 
—1.59, thus providing no evidence against the hypothesis that both series 
contain a single unit root. 
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3.6 Segmented trends, structural breaks and smooth transitions 


The difference stationary null hypothesis that has so far been considered is 
that the observed series {x;}4 is a realisation from a process characterised by 
the presence of a unit root and possibly a non-zero drift. Perron (1989) has 
generalised the approach so as to allow for a one-time change in the structure 
occurring at a known time Tg (1< Tg< T). Subsequently, the testing 
procedure was extended to the situation where the break point Tg is 
unknown and must be chosen prior to testing for a unit root — i.e. the break 
is determined endogenously. Further issues that must be considered are 
whether the break in structure is assumed to occur instantly or whether it 
occurs more slowly over time, and whether the break takes place under both 
the null and the alternative or under just the latter. 

The approach that we take here follows that of Vogelsang and Perron 
(1998), who may be consulted for references to the earlier literature on 
testing for unit roots when the break is endogenous. The case when the break 
is assumed to occur instantly is known as the additive outlier (AO) model, 
and three alternatives are considered: model (A), which allows a shift in the 
intercept of the trend function (the ‘crash’ model); model (B), which allows 
both a shift in intercept and slope (the ‘crash/changing growth’ model); and 
model (C), which allows a ‘smooth’ shift in the slope by requiring the end 
points of the two segments of the broken trend to be joined. The models 
are parameterised as follows, where Tg now denotes the ‘correct’ break date. 


Null hypotheses 

Model (A): x; = B + x:-1 + ODTB; + æ 

Model (B): x; = B+ xı + ODTB; + yDU; + ef 
Model (C): x; = B+ x1 + yDUS + æ 


Alternative hypotheses 

Model (A): x; = u + t + 0DU; + e 

Model (B): x, = u + pt +0DUf + yDT; + e 
Model (C): x, = u + Bt+yDTi + e 


The errors e, and e* can be assumed to be stationary processes, say e, = w(B) 
a, and e* = y*(B)a,, where y*(B) = (1 — B)w(B). The dummy variables 
are defined as 
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DTB; =1 ift=T+l1, 0 otherwise 
DU; =1_ if t >T, 0 otherwise 
DT; = t — T; if t >T, 0 otherwise 


Note that the dummy variables are linked through DTB; = ADU; = 
A?’ DT? . Model (A) characterises the ‘crash’ in the null hypothesis by a 
dummy variable that takes the value one at the time of the break, so that the 
level of x changes by 6 at Tg + 1. Under the alternative of a trend stationary 
system, model (A) allows for a one-time change in the intercept of the trend 
function, the magnitude of this change being 6. Model (B), the “changing 
growth’ model, specifies under the null that the drift parameter changes from 
p to B+6+y at time Tg +1 and then to +y afterwards. Under the 
alternative, the intercept changes by 0 and the slope changes by y at Tg + 1. 
Model (C) allows both segments of the trend under the alternative to be 
equal at Tp. 

Perron (1989) shows that standard tests of the unit root hypothesis are not 
consistent against trend stationary alternatives where the trend function 
contains a shift in the slope. Although such tests are consistent against a shift 
in the intercept of the trend function (if the change is fixed as T increases), 
their power is likely to be substantially reduced due to the fact that the 
limiting value of the estimate of the largest autoregressive root is inflated 
above its true value. (Leybourne, Mills and Newbold, 1998, investigate a 
converse phenomenon: standard tests can spuriously reject the unit root null 
if the break occurs early in the series.) 

Testing for a unit root in the AO framework consists of a four-step pro- 
cedure. The first step calculates the ‘detrended’ series x, whose form depends 


on which model is being considered. Thus, we let xd, j=A,B,C, be the 
residuals from a regression of x; on: 

(1) j= A: a constant, a time trend, and DU;; 

(2) j= B: a constant, a time trend, DU, and DT; and 

(3) j= C: a constant, a time trend and DT,. 


Note the change in notation: DU; and DT, are defined analogously to DUF 
and DT‘, but for any break date Tp. 

In the second step, the unit root hypothesis is tested using the t-statistic 
p= 1 in the regressions 


: k ee k aa x 
ti =X g 2iDTBi +e a+ GAH +u j=A,B (3.17) 
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and 
~C_ XC k ~C 
Xp = PX + YD, GAX T Ue 


where DTB, is defined analogously to DTBf. The k+1 dummies 
DTB,,..., DTB,_, are included in (3.17) to ensure that the limiting distri- 
butions of the t-statistics on @ are invariant to the correlation structure of 
the errors, although they are not needed in model C as in this case the 
distribution of the t-statistic is asymptotically invariant to the correlation 
structure. 

The third step is to compute the set of t-statistics for all possible breaks 
and to select the date for which the statistic is minimised. Thus, if the set of 
t-statistics are denoted as t(j, AO, Tg), then we define Tx as the break date for 
which 


t(j, AO, Ts) = inf t(j, AO, Ts)  T=[I1,T| 
teT 


This choice of Tg thus corresponds to choosing the break date that is most 
likely to reject the unit root hypothesis. Other possibilities are available for 
choosing Tg, however. For example, it might be selected as the value that 
maximises the significance of one or both of the break parameters 6 and y: 
Vogelsang and Perron (1998) discuss some of these alternatives. The fourth 
step is then to compare t(j, AO, Ts) with its critical value taken from the 
appropriate limiting distribution. 

One drawback of these models is that they imply that the change in the 
trend function occurs instantaneously. Perron (1989) generalises the models 
by assuming that x, responds to a shock in the trend function in the same way 
that it reacts to any other shock: this is known as the innovational outlier (IO) 
model. This assumption can be captured using the following specifications. 


Null hypotheses 

Model (A): x; = B+ x1 + Y* (B) (0DTB; Te ar) 

Model (B): x, = B+ x1 + Y* (B) (ƏDTB: + yDU; + ar) 
Model (C): x; = B + x1 + W*(B)(yDUs + a) 


Alternative hypotheses 

Model (A): x; = u + Bt + Y(B)(0DU$ + a;) 

Model (B): x; = u + Bt + Y(B)(86DUf + yDTE + a;) 
Model (C): x= u+fhtt+ Y(B)(yDT: + a) 
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Thus, the immediate impact of a shift in slope for model (C), say, is y, while 
the long-run impact is either w*(1)y or W(1)y. Model (B) can be tested by 
estimating the regression 


k 
xX, = u + Bt + dDTB, + 0DU, + yDT, + x-1 + J ee ÔA Xi + & 
(3.18) 


Under the null hypothesis of a unit root, é= 1 and @= $ = y = 0, whereas, 
under the alternative of a ‘segmented trend stationary’ process, ġ < 1 and 6, 
b, y £0. The null hypothesis can again be tested using the t-statistic for ¢ = 1 
from (3.18); the set of such statistics for all possible breaks can be denoted 
t(j, IO, Tg) and an analogous procedure to that discussed for the AO model 
may then be followed. For model (A), the same regression is estimated but 
with y=0, i.e. without DT; for model (C), DTB, and DU, are deleted 
(d=0=0). The case where there is no trend in the model is considered in 
Perron (1990). 

The asymptotic distributions of t(j, AO, Ts) and t(j, IO, Ts) depend on 
whether or not a break occurs under the null of a unit root — i.e. on whether 
@= y=0 under the null. When no break has occurred, so that we do have 
§=y=0, the asymptotic distributions are identical for both j= A, B, and 
critical values are given in Vogelsang and Perron (1998, tables 1, 2). These 
tables also give finite sample critical values for alternative ways of selecting 
the lag order k. For j= C the asymptotic distributions differ, and critical 
values are given in Perron and Vogelsang (1993). 

When a break does occur under the null, in particular when y is non-zero, 
so that there is a break in the drift, several difficulties can arise. Vogelsang 
and Perron (1998) show that, in this case, the sizes of the above testing 
procedures approach one as the sample size T grows, so that for very large 
samples spurious rejections of the null become almost certain. This is not 
mitigated by using other approaches to select the break date. Harvey, 
Leybourne and Newbold (2001), however, propose a modification that looks 
to have good size and power properties in the IO case. This is to select Tp as 
one plus the date selected by maximising the absolute values of the f-statistics 
on @ in the sequence of regressions of (3.18), with critical values of the 
resulting test of @= 1 being given in their table 2. 

More recent work has concentrated on extending these techniques to 
situations of multiple structural beaks (Bai, 1997; Bai and Perron, 1998) and 
of general, and possibly non-stationary, serial correlation in the errors 
(Perron, 1997; Vogelsang, 1997); for a detailed survey, see Perron (2006). 
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There is also a parallel literature developing on incorporating endogenously 
determined breaks into the stationarity test discussed in section 3.4; see, 
for example, Busetti and Harvey (2001, 2003) and Harvey and Mills (2003, 
2004). 

Rather than including lags of Ax, as in (3.18), to relax the instantaneous 
impact of the break, an alternative is to allow the trend to change gradually 
and smoothly between the two regimes. One possibility is to use a logistic 
smooth transition regression (LSTR) model. Leybourne, Newbold and Vougas 
(1998) propose the following three LSTR models to replace the segmented 
trend alternative hypotheses introduced above: 


Model (A): x; = My + M2 Si(y, m) + e 
Model (B): x; = My + Byt + MyS:(y, m) + e 
Model (C): x; = fy + Bit + M2 Sr(y, m) + BytS:(y, m) + e 


where S,(y, m) is the logistic smooth transition function 
Si(y, m) = (1 + exp(—y(t — mT)))"' 


which controls the transition between regimes. 

The parameter m determines the timing of the transition midpoint since, 
for y>0, S_x(y,m) = 0, Spoo(y,m) = 1 and Smr(y, m) = 0.5. The speed 
of transition is determined by the parameter y. If y is small then Sy, m) takes 
a long time to traverse the interval (0,1) and, in the limiting case when y = 0, 
Sy, m) = 0.5 for all t. For large values of y, S,(y, m) traverses the interval (0,1) 
very rapidly, and as y approaches ++oo it changes from zero to one instant- 
aneously at time mT. Thus, in model A, x; is stationary around a mean that 
changes from an initial value of mı to a final value of 4; + u2. Model B is 
similar, with the intercept changing from u; to yı +42, but allows for a fixed 
slope. In model C, in addition to the change in intercept from u, to (4; + 2, 
the slope also changes, with the same speed of transition, from £; to B, +f). If 
we allow y <0 then the initial and final model states are reversed, but the 
interpretation of the parameters remains the same. 

The smooth transition S,(y, m) does impose certain restrictions, in that 
the transition path is monotonic and symmetric around the midpoint. More 
flexible specifications, which allow for non-monotonic and asymmetric 
transition paths, could be obtained by including a higher-order time poly- 
nomial in the exponential term of S,(y, m). The constraints that the tran- 
sitions in intercept and slope occur only once, simultaneously and at the 
same speed could also be relaxed, although at some cost to interpretation and 
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ease of estimation. For example, Sollis, Leybourne and Newbold (1999) 
introduce asymmetry into the transition function, while Harvey and Mills 
(2002) consider double smooth transition functions. 

Leybourne, Newbold and Vougas (1998) investigate the question of dis- 
tinguishing between LSTR models and models containing unit roots, 
analogous to the analysis developed above with segmented trends. Their 
proposal is simple: estimate the models by non-linear least squares (NLS) 
(i.e. detrend x,), obtain the residuals, and compute an ADF test using these 
residuals. Once again, standard DF percentiles are invalid, and Leybourne, 
Newbold and Vougas provide the necessary critical values, which depend 
upon which LSTR model is fitted. If the null hypothesis is an I(1) process 
without drift then all three models are possible alternatives, while if the null 
is an I(1) with drift only models B and C can be realistic alternatives. 


Example 3.4 Unit roots and structural breaks in US stock prices 


Figure 3.7 


Figure 3.7 plots the logarithms of the nominal annual (January average) S&P 
stock index for the period 1871 to 2006. A conventional unit root test 
obtained the value of t, = —1.48, which, since T+ o.10 = —3.15, provides no 
evidence to reject the null hypothesis that stock prices are DS in favour of the 
alternative that they are TS. 


1 ST TTT TTT PITT port TOTTI TTT TTT TTT mmm 


1875 1900 1925 1950 1975 2000 


Logarithms of the nominal S&P 500 index (1871-2006) with a smooth transition trend superimposed 
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Following Perron (1989), however, we first consider the possibility of both 
a change in level and, thereafter, a changed trend rate of growth of the series 
after a break at an unknown point in time. We thus consider first the AO 
model (B) with k= 2, the value selected for the conventional unit root test 
reported above. This led to an estimated break point of Tg = 1928, consistent 
with the Great Crash of 1929 producing a downward level shift with, per- 
haps, a subsequent increased trend rate of growth. The t-statistic for testing 
p= 1 at this break, however, was —4.52, which is not significant at the 5 
per cent level using table 2 of Vogelsang and Perron (1998). 

Alternatively, using the IO version yields a break point of 1930 but, again, 
a t-statistic (of —2.45) that is insignificant. It would therefore appear that 
the stock price series, although an I(1) process, undergoes a break in level 
and drift after the Great Crash. Allowing for such a break under the null, 
in the fashion of Harvey, Leybourne and Newbold (2001), selects Tg = 1931 
but still provides no evidence against a unit root. Indeed, the fitted model 
under the null is 


Ax; = 0.068 DU, — 0.284 DTB, — 0. one DTB, + & 
(0.016) (0.162) (0.160) 


e, = —0.183 e2 +4, Gg = 0.1619 
(0.088) 


The level of x, thus drops by 0.28 in 1932 and by a further 0.69 in 1933, while 
the drift in the series is zero before 1931 and 6.9 per cent per annum 
afterwards. 

While the unit root hypothesis cannot be rejected in favour of a crash/ 
changing trend alternative, it is nevertheless worth investigating the possi- 
bility that the trend function could be modelled as a smooth transition. Since 
it seems sensible to allow for a transition in both intercept and trend, LSTR 
model C was estimated by NLS, yielding 


x = 1.388 + 0.0171 t — 3.633 S,(1.679, 0.585) 
(0.061) (0.0014) (0.268) 


+ 0.0530 tS,(1.679, 0.585) + uy 
(0.0027) 


This model can be interpreted as implying that the intercept decreased from 
1.388 to —2.249, while trend growth increased from 1.71 per cent per 
annum to 7.01 per cent. The midpoint of the smooth transition is estimated 
to be 1951, and, as 7 = 1.679, the speed of the transition is fairly quick. As 
can be seen from the smooth transition trend also shown in figure 3.7, the 
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transition takes about six years to complete. A unit root test computed 
using the residuals from this model yielded a value of —4.51. From table 1 
of Leybourne, Newbold and Vougas (1998), this is close to being significant 
at the 10 per cent level, and the residuals are well fitted by the AR(2) process 
u, = 0.935 uy) — 0.187 wH-2 +e, Ĝe = 0.1613 
(0.085) (0.085) 

which has roots of 0.65 and 0.29 and a standard error only a little larger than 
that obtained from the unit root model. This finding that stock prices can 
reasonably be modelled as the sum of a deterministic trend and a stationary 
innovation is, as Perron (1989, p. 1384) remarks, ‘particularly striking given 
the vast amount of theoretical and empirical studies supporting the random 
walk hypothesis in this situation’. 


3.7 Stochastic unit root processes 


Granger and Swanson (1997), McCabe and Tremayne (1995) and Leybourne, 
McCabe and Tremayne (1996) investigate an extension of the I(1) model to 
the case where the process has a stochastic unit root (STUR). A simple 
example is the random coefficient AR(1) process 


Xt = O:X-1 + a (3.20) 
OP m 1 + ôr 


where a, and 6, are independent zero-mean strict white-noise processes 
with variances o? ando;. If os = 0 then, clearly, = 1 for all t, whereas, if 
o? > 0, $, has a mean of unity and x, hence has a stochastic unit root. Why 
should such models be of interest here? Suppose that x, is the price of a 
financial asset, which then has a time t expected return of 


E =% 
E(r;) aA > g 


if any dividend payments are ignored for simplicity. Rearranging yields 
E(x) = (1 + E(ri))x 


Setting a, = x, — E(x) and defining 6; = E(r,) thus yields the STUR(1) 
model (3.20). The assumption that a, is white noise — i.e. returns are white 
noise — thus implies that the price levels have a stochastic unit root. 
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The STUR model can be extended in various ways, apart from considering 
higher-order processes with possible trends. Granger and Swanson (1997) 
consider (3.20) but with 


P, = exp(a:) 
where œ, is a zero-mean stationary stochastic process. Since we can write 
Qi = (%t/xi-1)(1 — ar/%:) 


we have 


a; = Alog(x;) + log(1 — at/ x4) ~ Alog(x:) — ar / xt 


again showing that, while log(x,) has an exact unit root, x, has a stochastic 
unit root. Leybourne, McCabe and Mills (1996) allow 6, to be an AR(1) 
process that may itself contain a unit root, so the transition between sta- 
tionary (¢;< 1) and explosive (¢,> 1) regimes does not necessarily occur 
randomly but is allowed to evolve in a more gradual fashion. 

Granger and Swanson (1997) show that standard DF tests usually have 
little power to distinguish a STUR model from a standard unit root process, 
but Leybourne, McCabe and Tremayne (1996) and Leybourne, McCabe and 
Mills (1996) provide alternative tests that perform much better. 

These papers also discuss maximum likelihood (ML) estimation and 
forecasting of STUR models and provide some evidence in favour of such 
models over standard unit root models for both the levels and logarithms of 
a variety of stock market indices and interest rates. For example, Leybourne, 
McCabe and Mills (1996) find that the following STUR model fits the 
daily levels of the London Stock Exchange FTSE 350 index over the period 
1 January 1986 to 28 November 1994: 


Ax, = B+ p Ax + G4 AX—4 + ôx — B(E— 1) — pixi — 4X5) + ar 
ôt = ô FN 


Here the level of the index, x, follows a (restricted) STUR(4) process with 
drift in which the stochastic unit root is itself a random walk. The root 
p= 1+6, fluctuates in a narrow band around 0.96, and hence is stationary. 
This model is found to produce a 7 per cent reduction in error variance over 
the competing ARIMA(4,1,0) model and some evidence of a better fore- 
casting performance. 
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The list of non-stationary processes described in this chapter is by no 
means exhaustive. Rather than allowing a one-time change or smooth 
transition in the trend parameters, a more general approach is provided by 
structural time series models that allow the parameters of the time trend to 
vary continuously, such as in a random walk fashion; these models are 
discussed in chapter 4. Granger, Inoue and Morin (1997) have proposed a 
general class of non-linearly drifting random walks, called non-linear sto- 
chastic trend processes. These are based on the so-called ‘growth processes’ 
and have increments that are equal to a positive function of the lagged series 
plus an innovation whose variance depends on the lagged series. The 
deterministic trends in these series are smooth parametric (e.g. sub-linear, 
sub-exponential, exponential or super-exponential) or non-parametric 
(e.g. kernel) functions of time. Another extension is the family of seasonal 
unit root processes (Hylleberg et al., 1990), which allow for deterministic 
or drifting seasonal variation. Both stochastic and seasonal unit roots are 
in the style of structural models in that they are based on time-varying 
parameters. 


Example 3.5 Unit roots in the Japanese equity premium 
An interesting issue concerning unit roots in the capital asset pricing model 
is investigated in Markellos and Mills (2001). Most econometric research on 
the CAPM is based on the ‘excess market’ regression 


ri — fi = & + Bim — fie) + uy 


where r, is the return on an asset at time f, f, is the return on a proxy for the 
risk-free interest rate and m, is the return on a proxy for the market portfolio 
returns, respectively. The variables r,— f, and m,— f, are typically called excess 
asset returns and excess market returns (or equity premia), respectively. 

Estimation using standard techniques requires that the two excess returns 
are stationary so that the CAPM regressions are ‘balanced’ (Banerjee et al., 
1993, pp. 164-8). While stationarity may be a stylized fact of asset and 
market returns, no such claim can be made for the risk-free rate. Markellos 
and Mills (2001) point out that, if r, and m, are assumed to be I(0) processes, 
and f, is I(1), then r,—f, and m,— f, will asymptotically form I(1) processes. 
Although the excess market regression will not be unbalanced, the slope, the 
so-called ‘beta coefficient’, will measure the relationship only between the 
dominant non-stationary parts on either side of the regression. 
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Figure 3.8 Nikkei 225 index prices and seven-year Japanese government bond yields (end of year 1914-2003) 
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Since these variables have an identical basis — the I(1) f; process — it would 
be expected that the beta from such a regression would converge to unity. 
Markellos and Mills provide simulation evidence showing that the correct 
null of a unit root in excess returns may not be accepted by standard Dickey- 
Fuller tests and that it is also possible to produce upward biases in betas. The 
size of the unit root tests on excess returns was found to improve somewhat 
as the variance of the underlying risk-free rates increased compared to that of 
the market returns. Conversely, biases in beta estimates became more severe 
as this variance increased. Empirical results supporting these arguments 
using UK and US data have been provided by Markellos and Mills (2001) 
and Yoon (2005). Yoon finds that a stochastic unit root process offers a 
better description of the proxy he used for the risk-free rate. Simulation and 
empirical results show that the implications of the existence of a stochastic 
unit root are qualitatively similar to those of a standard unit root. 

Figure 3.8 shows the logarithms of the Nikkei 225 index and seven-year 
Japanese government bond yields, respectively. Both series appear to contain 
a unit root, which was confirmed by a range of unit root tests. As shown in 
figure 3.9, however, the Japanese equity premium, the difference between 
the 1(0) Nikkei returns and the (1) risk-free rate, appears to be stationary. 
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(An ADF test for the equity premium m,— f, produces a test statistic of 
—8.09, which obviously strongly rejects the null of a unit root.) As argued 
previously, this is a result of the difference in the variances between the two 
series, with the Nikkei index returns having a standard deviation of 24.4, 
nearly fourteen times that of the risk-free rate of 1.8. 


4 


Univariate linear stochastic 
models: further topics 


4.1 Decomposing time series: unobserved component 
models and signal extraction 


4.1.1 Unobserved component models 
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If a time series is difference stationary then it can be decomposed into a 
stochastic non-stationary, or trend, component and a stationary, or noise, 
component — i.e. 


Xt = Z + Ut (4.1) 


Such a decomposition can be performed in various ways. For instance, Muth’s 
(1960) classic example assumes that the trend component z, is a random walk 


Zp = U + Z1 + Vt 


while u, is white noise and independent of v,, i.e. us~ WN (0, o2) and 
ve~ WN (0, o2), with E(uw,—;)=0 for all i. It thus follows that Ax; is a 


stationary process 
Ax, = U + Vi + Ut — Uui (4.2) 
and has an ACF that cuts off at lag one with coefficient 


o2 


ees S 4.3 


It is clear that —0.5 < p, < 0, the exact value depending on the relative sizes 
of the two variances, and that Ax, is an MA(1) process: 


Ax, = u + & — OG] (4.4) 
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where e, ~ WN (0,02). On defining « = 07/07, to be the signal-to-noise 
variance ratio, the relationship between the parameters of (4.2) and (4.4) can 
be shown to be 


8 = “((« + 2) — (k? 4 aw) P), k=(1-0)/6,«>0, |6|<1 


Thus x = 0 corresponds to 6 = 1, so that the unit roots in (4.4) ‘cancel out’ and 
the overdifferenced x, is stationary, while x =oo corresponds to 0=0, in 
which case x, is a pure random walk. A test of the stationarity null of 6 = 1 has 
been set out in section 3.4. It can therefore also be regarded as a test of the null 
o? = 0, for if this is the case then z, is a deterministic linear trend. 

Models of the form (4.1) are known as unobserved component (UC) 
models, a more general formulation for the components being 


Az, = u + y(B)v: (4.5) 
and 
Ut = à(B)a, 


where v, and a, are independent white-noise sequences with finite variances 
o? and o? and where y(B) and A(B) are stationary polynomials having no 
common roots. It can be shown that x; will then have the form 


Ax, = w+ O(B)e (4.6) 
where 6(B) and o? can be obtained from 


2 OBB!) > y(By(B') 3 5 
"ABBY Vapi By tO BAB) (4.7) 


From this we see that it is not necessarily the case that the parameters of 
the components can be identified from knowledge of the parameters of (4.6) 
alone; indeed, in general the components will not be identified. If z; is 
restricted to be a random walk (y(B)= 1), however, the parameters of 


the UC model will be identified. This is clearly the case for Muth’s (1960) 


2 


model, since o%, 


can be estimated by the lag one autocovariance of Ax, 
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(the numerator of (4.3)) and o? can be estimated from the variance of Ax, 
(the denominator of (4.3)) and the estimated value of a. 

This example illustrates that, even though the variances are identified, 
such a decomposition may not always be feasible, however, for it is unable to 
account for positive first-order autocorrelation in Ax, To do so requires 
relaxing either the assumption that z, is a random walk, so that the trend 
component contains both permanent and transitory movements, or the 
assumption that v, and a, are independent. If either of these assumptions is 
relaxed, the parameters of the Muth model will not be identified. 

As a second example, consider Poterba and Summers’ (1988) model for 
measuring mean reversion in stock prices. Rather than assuming the noise 
component to be purely random, they allow it to follow an AR(1) process 


Ut = À Ut—1 + at 

so that 

Ax, = u + v + (1 — AB)! (1 — B)a, 
or 


Ax* = (1 — à)u + (1 — àB)v; + (1 — B) a; 


where x* = (1 — AB)x;. Ax thus follows the ARMA(1,1) process 
(1 — àB) Ax; = bo + (1 — 81B) e, 


where e, ~ WN (0, o?) and 0o = u (1 — à). The formula (4.7) can be used to 
obtain 


6 = {2+ (14A) (1-ay((1+ aye 4 4e)? /2(1 + ax) 
and 
o; = (Ao, + 07) /01 


which, of course, reduce to the Muth formulae when A = 0. 

The assumption that the trend component, z, follows a random walk 
is not as restrictive as it may first seem. Consider the Wold decomposition 
for Ax; 


Ax = u + (Bye = ut Ba Wjej (4.8) 
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Since (1) = $- y; is a constant, we may write 

w(B) = w(1) + C(B) 

so that 

C(B) = w(B) — w() 

1+ B+ yB + WB (Hy Hh HH) 


= =y} (1 — B) — y (1 — B°) — y; (1 — B’) 
= (1 — B)(—y; — (1+ B) — ¥3(1+ B+ B’) —---) 


C(B) = (1 B) -Oze -OE-A -E ) 
= Ay(B) 
Thus 


w(B) = w(1) + Av(B) 
implying that 
Ax, = u + W(l)e + AW(B)e, 


This gives the decomposition due to Beveridge and Nelson (1981), with 
components 


AZ = w+ B v) e = u + wile (4.9) 
j=0 
and 
Uu = — > n) et — bs n) e-i — (>: v) e2 = = w(B)e, 
j=l j=2 j=3 


Since e, is white noise, the trend component is therefore a random walk with 
a rate of drift equal to u and an innovation equal to y(1)e,, which is thus 
proportional to that of the original series. The noise component is clearly 
stationary, but, since it is driven by the same innovation as the trend com- 
ponent, z; and u, must be perfectly correlated, in direct contrast to the Muth 
decomposition, which assumes that they are independent. For example, the 


115 


Univariate linear stochastic models: further topics 


Beveridge—Nelson decomposition of the IMA(1,1) (integrated moving average) 
process (4.4) is 


Az = u + (1 —0)e (4.10) 
Uut = Oe, (4.11) 


The relationship between the Beveridge—Nelson and Muth decom- 
positions is exact. Rather than assuming that u, and v; are independent, 
suppose that v, = œu. Equating (4.2) and (4.4) then yields 


Ax, = p + (1+ a)i — t1 = w+ e — Oer 


so that e, = (1 + œ) u, and 0e, = u, thus recovering (4.11) and implying that 

6=1/(1+ a). The trend (4.10) then becomes 

(1 — 6) 
0 


Az =pu+(1—0)e =u + U, = Utau=U+y, 
which recovers the Muth trend. 

Following Cuddington and Winters (1987), Miller (1988) and Newbold 
(1990), a simple way of estimating the Beveridge—Nelson components is to 
approximate the Wold decomposition (4.8) by an ARIMA (p,1,q) process 


TT (1 — 6,B—--- — 6,B") peg CD) r (4.12) 
(1-4,B----- ØB?) o(B) 

so that 

Az =u+y(l)e=uṣ+ AD e=u+ (=a) e (4.13) 

p0) (an ea 

where w(1) = 6(1)/(1). Equation (4.12) can also be written as 

p(B) E 

gp YDA% =p + (le (4.14) 


and comparing (4.13) and (4.14) shows that 


_ o(B) = 
Z% = gp YD = w(B)x; 
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The trend is therefore a weighted average of current and past values of the 
observed series, with the weights summing to unity since w(1) = 1. The noise 
component is then given by 


te = ty — (B)% = (1 — 0(B)) xy = (B) x, = LOB) — OOO) 


p(1)0(B) 


Since @(1) = 1 — w(1) = 0, the weights for the noise component sum to 


zero. Using (4.12), this component can also be expressed as 


$(1)6(B) — A) 9(B) 
P(1)O(B)A 


ur = ey (4.15) 


Since u, is stationary, the numerator of (4.15) can be written as #(1)6(B) — 
6(1)¢(B) = Avy(B), since it must contain a unit root to cancel out the one in 
the denominator. As the order of the numerator is max(p,q), p(B) must be 
of order r= max(p,q) — 1, implying that the noise has the ARMA (p,r) 
representation 


(Bur = (e(B)/G(1) er 
For example, for the IMA(1,1) process (4.4), the components are 
z = (1 — 6B) (1 — 0)x;, = (1 + 0B + 0B +---)(1—6)x; 

= (1—8) ee JE 


and 


(1-6B)-(1-0) _ (1—B) 
(1—6B) (1 6B) 


=09 Oe 


Thus, the trend can be recursively estimated as 


x; = 0(1 — OB) ‘Ax, 


ur = 


Zi = Oti + (1 To 0) xr, Uy = Ape A 


with starting values Zz; = xı and i, = 0. 

In a more general context, it is possible for an x, with Wold decomposition 
(4.8) to be written as (4.1), with z; being a random walk and u, being 
stationary and where the innovations of the two components are correlated 
to an arbitrary degree. Only the Beveridge—Nelson decomposition is guaran- 
teed to exist, however. 
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Example 4.1 Beveridge—Nelson decomposition of UK equities 
In example 2.6 the following ARIMA (2,1,2) model was fitted to the UK FTA 
All Share index 


(1 + 0.953B + 0.756B*) Ax, = 1.81 + (1 + 1.062B + 0.760B’) e; 


Thus, 


(1 + 1.062 + 0.760) 
(1 + 0.953 + 0.756) 


y(1) = = 1.041 


and the Beveridge—Nelson trend is therefore 


Az, = 1.81 + 1.04le, 


or, equivalently, 


Z = —1.0622;-1 — 0.760Z;-2 + 1.041x; + 0.993x;_1 + 0.787x;-2 


Since 


6(1)6(B) — 6(1)(B) = —0.114 + 0.188B — 0.074B? 
= A(—0.114 — 0.074B) 


the noise component is thus the ARMA (2,1) process 


u, = —0.953u;_1 — 0.75614,» — 0.042e, — 0.027¢,_; 


4.1.2 Signal extraction 


Given a UC model of the form (4.1) and models for z, and u, it is often useful 
to provide estimates of these two unobserved components: this is known as 
signal extraction. An MMSE estimate of z, is an estimate Z, that minimises 


E (£), where ¢, is the estimation error z; — 2. From, for example, Pierce 


(1979), given the infinite sample {x,}°°., such an estimator is 


A [0.6] 
2 = v:(B) x = po VzjXt—j 


j=- 
where the filter v, (B) is defined as 


_ 23y(B)y(B') 
HD = OET 
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in which case the noise component can be estimated as 


iy = x — Z% = [1 — v,(B)] x, = v,(B)x; 


For example, for the Muth model of a random walk overlaid with sta- 
tionary noise, 


o? h —]1\—l1 o? 1 o0 ye 
a a ‘(1 — 6B") 20 Glee 


so that, using o? = (1 — 6)’o2, obtained using (4.6), we have 


r (1 S 0)? oo i 
R a 


j==œ% 


Thus, for values of 6 close to unity, 2, will be given by a very long moving 
average of future and past values of x. If 8 is close to zero, however, 2, will be 
almost equal to the most recently observed value of x. From (4.3) it is clear 
that large values of 0 correspond to small values of the signal-to-noise ratio 
k = 0? / o?; when the noise component dominates, a long moving average of 
x values provides the best estimate of trend, while, if the noise component is 
only small, the trend is given by the current position of x. 
The estimation error, ¢, = Zz — Z, can be written as 


= v,(B)Z% = vy(B) u 


and Pierce (1979) shows that ¢, will be stationary if z, and u are generated by 
processes of the form (4.4). In fact, ¢, will follow the process 


C= 6-(B)é, 
where 

_ y(B)A(B) 3 _ 030; 
a= gay eae 


and &, ~ WN (0, aż). 
For the Muth model we therefore have it that ¢, follows the AR(1) process 


(1 ma 0B)¢, = &, 


and the mean square error (MSE) of the optimal signal extraction procedure is 
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As noted earlier, if we are given only {x,} and its model, i.e. (4.6), then 
models for z, and u, are in general unidentified. If x, follows the IMA(1,1) 
process 


(1 — B)x; = (1 — OB)e, (4.16) 
then the most general signal-plus-white-noise UC model has z; given by 
(1 — B)z = (1 — OB), (4.17) 


and for any © value in the interval — < © < 0 there exist values of o? and o? 
such that z,+ u, yields (4.16). It can be shown that setting © = —1 minimises 
the variance of both z; and u, and this is known as the canonical decomposition 
of x, Choosing this value implies that y(B) = 1+ B, and we thus have 


01+ B)(1+ B!) 
* ~ 62(1 — OB)(1 — 0B") 


and 


(1 — 6B)¢, = (1 + B)é, 


In this development we have assumed that, in estimating z,, the future of 
{x} is available as well as its past. In many situations it is necessary to 
estimate z, given only data on x, up to s=t— m, for finite m. This includes 
the problems of signal extraction based either on current data (m= 0) or on 
recent data (m<0), and the problem of forecasting the signal (m>0). We 
therefore need to extend the analysis to consider signal extraction given only 
the semi-infinite sample {x;,s < t — m}. Pierce (1979) shows that, in this 
case, an estimate of z; is given by 


= 
3 
w 


jz 


(=B) a 
o26(B) [(1 — B)O(B=')] „ 


in which we use the notation 
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ee : 
[ACB n= Dy HB! 


Thus, for the Muth model we have 


Ayer oR USB) 
ue = aon) K =. 


and Pierce (1979) shows that this becomes, for m > 0, 


B =A OB = 0- > (6B) 


of (1 > j=0 j=0 
while, for m < 0, 
lee) 1 -m-l 
v" (B) = 0 "(1 — 0)B" X (0BY + 5 0B 
= (1 — 6B) = 


Therefore, when either estimating z, for the current time period (m= 0) or 
forecasting z, (m > 0), we apply an exponentially weighted moving average to 
the observed series, beginning with the most recent data available, but not 
otherwise depending on the value of m. For m < 0, when we are estimating Z; 
based on some, but not all, of the relevant future observations of x,, the filter 
comprises two parts: the same filter as in the m>0 case applied to the 
furthest forward observation but with a declining weight (9) placed upon 
it, and a second term capturing the additional influence of the observed 
future observations. 

UC models can also be analysed within a state space framework, in which 
the Kalman filter plays a key role in providing both optimal forecasts and a 
method of estimating the unknown model parameters. In this framework, 
models such as the random-walk-plus-white-noise are known as structural 
models, and a thorough discussion of the methodological and technical ideas 
underlying such formulations is contained in Harvey (1989) and Harvey and 
Shephard (1992), while Koopman, Shephard and Doornik (1999) and 
Koopman et al. (2006) provide computer software. 

The UC model (4.5) is also related to the Hodrick—Prescott (H-P) trend 
filter (Hodrick and Prescott, 1997), which is a very popular method of 
detrending economic time series. This filter is derived by minimising the 
variation in the noise component, u,=x,— Z+, subject to a condition on 
the ‘smoothness’ of the trend component z, This smoothness condition 
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penalises acceleration in the trend, so the minimisation problem becomes 
that of minimising the function 


we u, +ô ya (241 = 4) — (z — z-1))" 


with respect to z,, t=0,1...,T +1, where ô is a Lagrangean multiplier that 
can be interpreted as a smoothness parameter. The higher the value of 5 the 
smoother the trend, so that in the limit, as 500, z, becomes a linear trend. 
The first-order conditions are 


0= 2(% Zt) + 28( (z 2-1) (Zp-1 Z-2)) 
— 48( (Z1 — Zt) — (Zt — %-1)) + 28((442 — 241) — (41 — 2) 


which may be written as 


Xi = 2 + 6(1 — B)? (Z — 22141 + 442) = (1 Hô(1 B) (1 B) )z 
so that the H-P trend estimator is 

, 7 oye! 

2,(5) = (146(1 - B) (1-B) ) x; 


The MMSE trend estimator can be written using (4.7) as 


, _oy(B)y(B') y(B)y(B') 
y(B)y(B-!) + (02 /02)a(B)A(B-) 


Xt 


Comparing this with the H-P trend estimator shows that, for the latter to 
be optimal in the MMSE sense, we must set 


y(B)=(1-B)', A(B)=1, ô= 03/0; 


In other words, the underlying UC model must have the trend component 
A’*z,=v, and u, must be white noise. Related filters are the band-pass of 
Baxter and King (1999) and Christiano and Fitzgerald (2003), and the 
Butterworth, suggested by Pollock (2000): see, for example, Mills (2003, 
chap. 4) for details. 


Example 4.2 Estimating expected real rates of interest 
A long-standing example of the unobserved random walk buried in white 
noise is provided by the analysis of expected real rates of interest under the 
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assumption of rational expectations, or, equivalently, financial market 
efficiency: see, for example, Fama (1975), Nelson and Schwert (1977) and 
Mills and Stephenson (1986). In this model, the unobservable expected real 
rate, Z, is assumed to follow a driftless random walk, i.e. equation (4.10) with 
0=0, and it differs from the observed real rate, x,, by the amount of 
unexpected inflation, u, which, under the assumption of market efficiency, 
will be a white-noise process. The observed real rate will thus follow the 
ARIMA(0,1,1) process shown in (4.9). 

Such a model fitted to the real UK Treasury bill rate over the period 
1952Q1 to 2005Q3 yielded 
Ax, =(1—0.810B)e,, 62 = 15.61 
From the relationships linking o? and o? to 6 and o, it follows that the 
unobserved variances may be estimated as 


ô? = (1 — 0.810)°62 = 0.56 


ô? = 0.81062 = 12.64 


yielding a signal-to-noise variance ratio of k = 6? / & = 0.04, so that vari- 
ations in the expected real rate are small compared to variations in unex- 
pected inflation. Expected real rates based on information up to and 
including time t, i.e. m=0, can then be estimated using the exponentially 
weighted moving average 


2 =x 
Zt = 0.8121 + 0.19x; b= 25 3 joc 


Unexpected inflation can then be obtained as i, = x, — Z. Figure 4.1 
provides plots of x,, 2, and i,, showing that the expected real rate is con- 
siderably smoother than the observed real rate, as was suggested by the small 
signal-to-noise ratio. In the early part of the 1950s expected real rates were 
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Figure 4.1 Real UK Treasury bill rate decomposition (quarterly January 1952—September 2005) 
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generally negative, but from 1956 to 1970 they were consistently positive. 
From the middle of 1970 and for the subsequent decade the expected real 
rate was always negative, reaching a minimum in 1975Q1 after inflation 
peaked in the previous quarter as a consequence of the Organization of 
Petroleum Exporting Countries (OPEC) price rise, and a local minimum in 
1979Q2, this being a result of the value added tax (VAT) increase in the 
budget of that year. From mid-1980 the series is again positive and remains 
so until the end of the sample period. Fluctuations in unexpected inflation 
are fairly homogenous except for the period from 1974 to 1982. 


4.2 Measures of persistence and trend reversion 


4.2.1 Alternative measures of persistence 


Consider again the Wold representation 
Ax, = u + Y(B)a, = u + ee Wiarj (4.18) 


From (4.18), the impact of a shock in period t, a, on the change in x in period 
t+ k, Ax;4% is wy The impact of the shock on the level of x in period t+ k, 
X+ p is therefore 1+ Yı + --- + Yp The ultimate impact of the shock on the 
level of x is the infinite sum of these moving average coefficients, defined as 


Ww) = lt Wt oto = OR Yj 


The value of w(1), which is used in the Beveridge—Nelson decomposition 
above, can be taken as a measure of how persistent shocks to x are. For 
example, w(1) =0 for any trend stationary series, since w(B) must contain a 
factor (1 — B), whereas w(1)=1 for a random walk, since y;= 0 for j> 0. 
Other positive values of w(1) are, of course, possible for more general 
difference stationary processes, depending upon the size and signs of the y. 

Difficulties can arise in estimating w(1) because it is an infinite sum, thus 
requiring the estimation of an infinite number of coefficients. Various 
measures have thus been proposed in the literature to circumvent this 
problem, two of the most popular being the impulse response function, 
implicit in the Beveridge—Nelson approach and also proposed by Campbell 
and Mankiw (1987), and the variance ratio of Cochrane (1988). 

The impulse response measure of y(1) is based on approximating Y(B) 
by a ratio of finite-order polynomials. This is possible because, since it is 
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assumed that Ax, is a linear stationary process, it follows that it has an 
ARMA(p,q) representation 


Q(B) Ax; = % + O(B)a; 


Equation (4.18) is then interpreted as the moving average representation, or 
impulse response function, of Ax;: 


Ax; = (1) "Ap N p(B) '0(B) a; 


From the equality Y(B) = ¢(B) '0(B), the measure (1) can then be cal- 
culated directly as w(1) = 0@(1)/(1). 

Cochrane (1988), on the other hand, proposes a non-parametric measure 
of persistence known as the variance ratio, defined as Vi; = 07, / o;, where 


om = KV (x, — xk) = KV (Ag) 


A= 1 — B“ being the kth differencing operator. This measure is based on 
the following argument. If x, is a pure random walk with drift, so that 
Ax,= 6+ a,, then the variance of its kth differences will grow linearly with k; 
using the fact that A, = A(1+ B+- + B*'), 


VA) = V Gg — x1) + (x1 — x2) Oe — Xek) ) 


k k 
= Do V (x —j41 = Xt—j) = ee V(a,) = ko? 


If, on the other hand, x, is trend stationary, the variance of its kth differences 


approaches a constant, this being twice the unconditional variance of the 
series: if x, = By + bit + ar, V(Axgx;) = V (ar) + Vlak) = 20°. 

Cochrane thus suggests plotting a sample estimate of o% as a function of k. 
If x, is a random walk the plot should be constant at o?, whereas if x, is trend 
stationary the plot should decline towards zero. If fluctuations in x, are partly 
permanent and partly temporary, so that the series can be modelled as a 
combination of random walk and stationary components, the plot of o? 
versus k should settle down to the variance of the innovation to the random 
walk component. 

In providing a sample estimate of oj, Cochrane corrects for two possible 
sources of small sample bias. First, the sample mean of Ax, is used to esti- 
mate the drift term p at all k, rather than a different trend term at each k 
being estimated from the mean of the k-differences. Second, a degrees of 
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freedom correction T/(T— k— 1) is included, for, without this, on will 
decline towards zero as k—T because, in the limit, a variance cannot be 
taken with just one observation. These corrections produce an estimator of 
o% that is unbiased when applied to a pure random walk with drift. The 
actual formula used to compute the estimator from the sample yeah is 
(Cochrane, 1988, equation (A3), p. 917) 


i: T T k 2 
A HT- EF D 3 ( Mk pOr 3) oa 


From Cochrane (1988), the asymptotic standard error of ô? is 
(4k/3T)°°62. The variance ratio can then be estimated as V, = 6? 2/ê?. 


Cochrane shows that V; can also be written as 
Vk=1+ 2, m Pj 

so that the limiting variance ratio, V, can be defined as 
V =limp.o Vk = 1 + > pee Pj 


Furthermore, since it can also be shown that 


V can also be written as 


V = (° /i) W0) 


an provides the link between the two persistence measures. By defining 
== (o ; / 7); the fraction of the variance that is predictable from 
ne of the past history of Ax,, we have 


ee 


so that w(1) > V/V: the more predictable is Ax, the greater the difference 
between the two measures. 
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4.2.2 Testing for trend reversion 


Whether a random walk is present in a financial time series has been shown 
to be a question of some importance. The various unit root tests discussed in 
chapter 3 are, of course, one approach to testing whether a series contains a 
random walk component. As we have seen, however, such tests can have 
difficulties in detecting some important departures from a random walk, and 
the associated distributions of the test statistics tend to have awkward 
dependencies on nuisance parameters. 

When the null hypothesis under examination is that the series is generated 
by a random walk with strict white-noise normal increments, a test based on 
the variance ratio may be preferred. Consider again the observed series fags 
and suppose that x; is generated by the random walk 


X = 0 + X1 +a 


where a, ~ NID(0, o°). For this model, the variance ratio, V, is unity. Lo and 
MacKinlay (1988, 1989) consider the test statistic 


M(k) = 6/6 -1= Ô, -1 


and show that 


= = -lp 
a(t) = M(x): (2 ok 2) £N(0,1) 


They also derive a version of the variance ratio test that is robust to serial 
correlation and heteroskedasticity. If a, takes the conditions (3.11), the test 
statistic becomes 


z(k) = M(k)- Q7'2(k) 


where 


k-1 An 2 T 
2(k — j) i=j Xote 
w- (AP) 5 5 SSS 


j=l 


l 2 
Aj = | X- — Xj — Hr — 40) 


The ô; are heteroskedastic-consistent estimators of the asymptotic variances 
of the estimated autocorrelations of Ax, 
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Lo and MacKinlay (1989) find that this large-sample normal approxi- 
mation works well when k is small and T is large. They emphasise, however, 
that it can become unsatisfactory for large k because the empirical distri- 
bution of M(k) is highly skewed in these circumstances. Although the 
empirical sizes of the test statistic are close to their nominal values, almost all 
the rejections occur in the upper tail of the distribution. It is therefore clear 
that the normal approximation to the distribution of M(k) is likely to be of 
only limited practical use. As a consequence, the empirical distributions of 
the test statistics need to be evaluated by simulation. 

In any case, the asymptotic normality of M(k) relies on fixing k and 
allowing T to increase, so that k/T — 0. Richardson and Stock (1989) 
consider a different perspective, however, in which k is allowed to tend 
asymptotically to a non-zero fraction (ô) of T, i.e. k/T — ô. Under this 
asymptotic theory, M(k) has a limiting distribution that is not normal but 
has a representation in terms of functionals of Brownian motion, W(r), 
which under the null does not depend on any unknown parameters: 


M(k) =;/ [Y(r))2dr 


where 


Y(r) = W(r) — W(r —8) —8W(1) 


Richardson and Stock (1989) argue that the k/T — 6 theory provides a 
much better approximation to the finite sample distribution of M(k) than 
does the fixed k theory. Moreover, this limiting distribution is valid even 
under non-normality and certain forms of heteroskedasticity. Lo and 
MacKinlay (1989) find that the power of the variance ratio test is comparable 
in power to tT, when x, is trend stationary. Deo and Richardson (2003), 
however, point out some difficulties with the asymptotic consistency and 
power of the test, which suggest that k should not be set too large when T is 
small. 

A typical use of the variance ratio test is to calculate the statistic for various 
values of k and reject the null of a random walk if any of the statistics are 
significant at a pre-assigned significance level (see, for example, Liu and He, 
1991). This gives rise to the problem of multiple comparisons among test 
statistics, as it is inappropriate to focus on the significance of individual test 
statistics without controlling for the size of the implicit joint test. Chow and 
Denning (1993) and Fong, Koh and Ouliaris (1997) thus propose joint 
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variance ratio tests. Chow and Denning propose a joint test based on the 
maximum absolute value of the variance ratio statistics, which is compared 
to the studentised maximum modulus distribution, for which tables of 
critical values are available. Unfortunately, this test assumes that the vector of 
variance ratio test statistics is multivariate normal, but, as we have seen, this 
is unlikely to be the case when k/T is large. In this situation, there will also be 
a high degree of ‘overlap’ in the data induced by taking k-differences, which 
will in turn induce dependencies between the individual statistics. 

Fong, Koh and Ouliaris (1997) use the results of Richardson and Smith 
(1991) and Richardson (1993) on modelling serial correlation induced by 
overlapping data to derive the following Wald statistic to examine the joint 
hypothesis that an mx 1 vector of variance ratio test statistics 


Mm = (M(m,), M(1m),...,M(mn))" 
is equal to zero: 
T- M} Mm ~ xX (m) 


where ® is the covariance matrix of M,,. For any pair of lags mj= r and 
m= 5, 1 < j,k < m, ® has the elements 


bia 2(2r — — 1) ne 2(2s — vis —1) 
2((3s — r—1)(r—1)) 
3s 


Pi = Py = 


Simulation results presented by Fong, Koh and Ouliaris suggest that the size 
and power of this statistic is satisfactory. 


Example 4.3 Persistence and mean reversion in UK stock prices 
In example 2.6 we fitted an ARIMA(2, 1,2) process to the logarithms of 
the FTA All Share index, with ¢(B) = (1+ 0.953B+ 0.756B’) and 6(B) = 
(1 +1.062B+ 0.760B°). Thus, A(1) =2.822/2.709 = 1.041, which provides 
some evidence in favour of mean aversion, whereby a series will continue to 
diverge from its previously forecast value following a shock. Since the null 
hypothesis A(1) = 1 can be parameterised as ġı + $2 — 0; — 02 = 0, however, 
a Wald test of this linear restriction has a marginal significance level of 0.07, 
suggesting that such evidence is quite weak. Cochrane (1988) has criticised 
the use of fitted ARIMA models for constructing the long-run measure A(1) 
because they are designed to capture short-run dynamics (recall their 
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Table 4.1 Variance ratio test statistics for UK stock prices (monthly 1965-2002) 


k M pz) 5% 50% 95% 


12 0.13 0.23 — 0.29 — 0.04 0.30 
24 0.01 0.48 — 0.41 — 0.08 0.41 
36 — 0.06 0.57 — 0.49 —0.11 0.46 
48 — 0.15 0.67 — 0.55 — 0.14 0.58 
60 —0.11 0.61 — 0.60 — 0.18 0.61 
72 —0.11 0.60 — 0.65 — 0.20 0.64 
84 — 0.18 0.65 — 0.70 — 0.25 0.64 
96 — 0.21 0.66 — 0.72 — 0.30 0.64 


Note: p(z,) denotes the probability under the null hypothesis of observing a larger 
variance ratio than that observed using the asymptotic N(0,1)distribution. 5%, 50%, 
95% are percentiles of the empirical distributions of M(k) computed under the 
k/T—6 asymptotic theory using NID (0,1) returns with 5000 replications for each k. 


development for short-term forecasting by Box and Jenkins, 1976), rather 
than the long-run correlations that are of interest here. 

Table 4.1 presents M(k) statistics for a sequence of k values associated with 
‘long-differences’ of prices of between one and eight years. Also provided are 
the p-values using the normal approximation and simulated upper tail 
percentiles using the Richardson and Stock (1989) k/ T—ô asymptotic theory. 
Using either distribution, there is no evidence to reject the random walk null, 
although for k > 36 the Vi statistics are all less than unity, unlike the estimate 
of the variance ratio given by A(1). In support of this conclusion, the joint 
test statistic has a value of 13.32. As it is distributed as xj, this also cannot 
reject the random walk null, having a marginal significance level of 0.101. 


4.2.3 Mean reverting models in continuous time 


Several continuous processes have been developed for modelling mean 
reversion in interest rates (for a recent review, see Bali and Wu, 2006). Chan 
et al. (1992) show that many of the single-factor term structure models for 
the short-term interest rate r can be nested in the differential equation 


dr = (a+ br)dt + or’ dZ 


where Z is a Weiner process. These dynamics imply that the conditional 
mean and the variance of the process depend on the level of r through the 
parameters b and y, with mean reversion being captured by b< 0. This 
stochastic differential equation nests a broad class of interest rate processes, 
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with eight of the most popular specifications being obtained by placing 
appropriate parameter restrictions (for references and detailed descriptions 
of these models, see Chan et al., 1992). 

Model 1 (Merton’s model) is obtained if we assume that b= y=0 and 
corresponds to a simple Brownian motion with drift. By setting just y = 0 we 
get model 2 (Vasicek’s model), which is a mean-reverting process known as 
the Ornstein—Uhlenbeck. Obviously, the conditional volatility is constant in 
these first two specifications since y is equal to zero. Model 3 (Cox, Ingersoll 
and Ross’ ‘square root process’), which sets y = 0.5, implies that conditional 
volatility is proportional to r. By letting a = 0 and y = 1, model 5 is obtained, 
which corresponds to the geometric Brownian motion process, while the 
additional restriction b=0 allows the simpler driftless specification in 
model 4 (Dothan’s model). Model 6 (Brennan and Schwartz’s model) 
encompasses the two previous specifications and is obtained by just setting 
y = 1. In these last three models the conditional volatility is proportional to 
1”. Finally, model 7 (Cox, Ingersoll and Ross’ variable rate process) and 
Model 8 (constant elasticity of variance) are obtained by restricting a= b=0 
and y= 1.5, and a=0, respectively. 


Model 1: dr=adt+odZ Merton =y=0 

Model 2: dr= (a+ br)dt+odZ Vasicek y= 

Model 3: dr = (a+ br)dt + øoyrdZ Cox, Ingersoll and Ross (CIR) y= 0.5 

Model 4: dr=ordZ Dothan a= b=0, y=1 
Model 5: = dr= brdt+ordZ Geometric Brownian motion a=0, y=1 
Model 6: dr= (a+ br)dt+ ordZ Brennan and Schwartz y=1 

Model 7:  dr=or'?dZ CIR variable rate a=b=0,y=15 
Model 8&: dr= brdt+ or dZ Constant elasticity of variance a=0 


Chan et al. (1992) base estimation of these models on the following 
discrete-time approximation of the continuous-time differential equations: 


Ar; =a + bri + Et 


where £+is a disturbance term with moments E(£,) = 0 and E (e?) =0? r. 


If interest rates are measured over a short time span then the approximation 
error in this model will be of second-order importance. Generalised method 
of moments (GMM) estimation is appropriate since it requires only that the 
distribution of interest rate changes is stationary and ergodic and that the 
relevant expectations exist. This is particularly useful here since the various 
models imply different distributions. For example, the first two imply that 
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the distribution of interest rate changes is normal, while model 3 assumes 
that these changes are distributed as a random variable proportional to a 
non-central x?. Finally, the GMM estimators have the additional advantage 
of being consistent even if the disturbances are conditionally heteroskedastic. 

Estimation begins by determining the moment conditions. Let 0 be the 
parameter vector with elements a, b, o° and y. Four moment conditions are 
sufficient to identify the system exactly. Two conditions follow from the 
mean and variance of the error term. Given that ¢,=7r,— 1,_ ; — a— brip 
the other two reflect orthogonality conditions: 


2 22Y 
(e —0 aga 


Under the null hypothesis that the restrictions implied by the model are 
satisfied, E(f,(0)) =0. The GMM estimation procedure uses the T observa- 
tions to replace E(f,(@)) =0 with its sample counterpart g7(@), given by 


1 T 
ORES DINO 
and then chooses parameter estimates that minimise the quadratic form 
Ir(0) = gr (0)Wr(0)gr(0) 


where W;(6) is a positive-definite symmetric weighting matrix. The min- 
imised value of Jr (0) is distributed under the null hypothesis that the model 
is true as x° with degrees of freedom equal to the number of orthogonality 
conditions minus the number of estimated parameters: see Newey (1985) for 
further details on the use and asymptotic properties of this statistic for 
testing general model misspecification. 

Hansen (1982) shows that choosing W,(6)=S '(@), where S(0) = 
E| f:(@)f,'(@)|, produces the estimator of 0 with the smallest asymptotic 
covariance matrix, given by T~!(Dj (0)S)!(@)Do(9)) ', where Do(6) is the 
Jacobian of g7(0) evaluated at the estimated parameters. This can be used to 
test the significance of individual parameter estimates. (Further discussion of 
GMM estimation within a regression framework may be found in chapter 8, 
section 1.4.) 

A number of other mean-reverting continuous-time processes have 
been developed for describing the dynamics of interest rates. For example, 
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Figure 4.2 Three-month US Treasury bills, secondary market rates (monthly April 1954—February 2005) 


Ait-Sahalia (1996a) places restrictions on the drift coefficient of a diffusion 
process and proposes a non-parametric estimator for its diffusion coefficient. 
Empirical results in Ait-Sahalia (1996b) suggest that the linearity of the drift 
appears to be the main source of misspecification in several one-factor dif- 
fusion models of interest rates. He thus favours a non-linear mean reversion 
process, which is globally stationary despite local random walk behaviour, 
and uses non-parametric methods for estimating the drift and volatility. 


Example 4.4 Approximating mean reversion in continuous time for 

US interest rates 
We estimate the various models discussed above by the GMM method using 
monthly data on the US three-month Treasury bill rate, shown in figure 4.2. 
The data span the period from April 1954 to February 2005, a total of 623 
observations. The estimated parameters and J-statistics for each of the models 
are shown in table 4.2. In terms of these statistics, models 2 and 3 are likely 
to be misspecified, as the J-statistics have p-values of less than 5 per cent. 
Inspection of the parameter estimates and standard errors suggests that this is 
a consequence of insignificant mean reversion parameter estimates. On this 
criterion, models 4 and 7 offer the best fit to the data. The unrestricted model 
estimates suggest that, although mean reversion appears to be insignificant, 
conditional volatility is sensitive to the level of the process through the stat- 
istically significant estimate of y = 1.54. 
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Table 4.2 Interest rate model parameter estimates 


Specification a b o y J 
Unrestricted 0.0079 — 0.1365 0.7818 1.5422 
0.0060) (0.1236) 0.8898) (0.2093) 
Model 1 0.0008 0.0001 4.6943 
0.0019) 0.0000) 0.0957 
Model 2 0.0039 — 0.0664 0.0001 4.8159 
0.0056) (0.1174) 0.0000) 0.0282 
Model 3 0.0034 — 0.0582 0.0021 0.5 4.6168 
0.0056) (0.1164) 0.0004) 0.0317 
Model 4 0.0384 1 3.3644 
(0.0057) 0.3388 
Model 5 0.0151 0.0381 1 3.3142 
(0.0402) (0.0058) 0.1907 
Model 6 0.0039 — 0.0651 0.0379 1 3.0430 
(0.0055) (0.1157) (0.0058) 0.0811 
Model 7 0.6095 1.5 1.8888 
(0.0788) 0.5958 
Model 8 0.0178 0.4577 1.4475 1.6531 
(0.0403) 0.5778) (0.2304) 0.1986 


Note: Figures in parentheses denote asymptotic standard errors. In square brackets we give 
the p-values associated with each J-statistic, which is distributed under the null that the 
model is true as xz. The degrees of freedom k are equal to the number of orthogonality 
conditions minus the number of estimated parameters. 


These results are comparable to those of Chan et al. (1992) and others, 
who also find no evidence of mean reversion and a comparable value for y 
of around 1.5. Insignificant mean reversion implies that interest rates are 
non-stationary. Recent research by Faff and Gray (2006) suggests that this 
insignificant mean reversion, which has been reported by several researchers, 
may be due to the poor finite sample properties of the GMM estimator. 
Moreover, as shown by Ait-Sahalia (1999), among others, non-linear specifi- 
cations of the drift component may produce different results. 


4.3 Fractional integration and long memory processes 


4.3.1 A broader definition of stationarity 


A great deal of theoretical and empirical research has been produced in 
recent years on long memory processes, which generalise the conventional 
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unit root model of non-stationarity and allow a wide range of short- and 
long-run dynamic behaviour. Research in this area has brought into the open 
several concerns about the correct definition of stationarity and long-run 
behaviour. Although most researchers agree that ‘smoothness’ can be related 
to the long-run properties of a time series, there are no conclusive empirical 
results on what the best models to use are. The issue of comparing alternative 
models of long-run time series behaviour becomes even more elusive once 
non-linearities are considered. As noted by Granger (1997), it is possible that 
some of these controversies originate from the fact that the term ‘trend’ has 
not received satisfactory mathematical attention. Moreover, the treatment 
of trends within the standard unit root and trend stationary framework is 
not sufficiently general to model all the situations that may arise, although 
Phillips (2001, 2003, 2005b) has made important advances in this area in 
recent years. 

Motivated by these concerns, Granger and Hallman (1991) and Granger 
(1995), among others, have discussed a framework based on the notion of 
extended memory. Although the general concept of stationarity is, as we have 
consistently emphasised, central to the theory of modern time series analysis, 
it is almost impossible to test for directly and can be defined only in terms of 
individual properties, such as constancy of moments and extent of memory. 
Consider the conditional probability density function of x+}, given the 
information set I; : x;j,q;—;,j 2 0, where qz is a vector of other explanatory 
variables. The series x; is said to be short memory in distribution (SMD) with 
respect to I, if 


|P(%:4n in All, in B) — P(x;4, in A)| — 0 


as hoo for all appropriate sets A and B such that P(I, in B) > 0, where P(S) 
denotes the probability of the event S occurring. The complement of an 
SMD process is long memory in distribution (LMD). 

A more narrow definition of memory can be made with respect to the 
mean of a process. The conditional mean may be defined as 


E(x nll) = Er, h 


so that &, ;, is the optimum least squares forecast of x, , using I,. Then xis 
said to be short memory in mean (SMM) if 
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where 8, is a random variable with a distribution that does not depend on I,. 
The most interesting case is when this distribution is singular, so that ©; 
takes just the single value u, which is the unconditional mean of x, assumed 
to be finite. If €,;, depends on I, for all h, then x, is said to be extended memory 
in mean (EMM). 

Using the above definition, a process is characterised as SMM if a con- 
ditional h-step forecast of the mean tends to a constant as h increases. This 
implies that any historical information that is available at time t should 
become progressively less relevant as the forecast horizon increases. SMMs 
are complemented by EMM processes, in which present information is 
generally useful in forecasting the mean irrespective of the forecast horizon. 
Essentially, SMM and EMM correspond to the conventional properties of 
mixing and non-mixing in the mean, respectively. The notion of extended 
memory can be used in defining non-linear forms of persistence: for example, 
an EMM process can be defined as ‘extended I(1)’ if it has increments that are 
SMM. Granger (1995) emphasises that only monotonic, non-decreasing 
functions of an EMM will remain EMM (e.g. a logarithmic, polynomial or 
logistic transformation of a random walk): for example, the cosine of a ran- 
dom walk will be SMM, while non-linear transformations of homoskedastic 
SMM processes will also be SMM. 

Park and Phillips (1999) have developed some asymptotic theory for non- 
linear transformations of integrated stochastic processes. They demonstrate 
that this theory differs from that of integrated and stationary time series and 
that the convergence rates of sample functions depend on the type of non- 
linear transformation. Granger (1995) discusses the use of the attractor 
concept in generalising the notion of mean reversion and the definition 
of I(0) processes. More specifically, he uses the attractor to characterise 
non-linear processes that have a tendency to return to some set of values, 
rather than to a single value (e.g. the mean). This set could be a limit cycle or, 
in the case of a so-called ‘chaotic process’, a fractal set (or strange attractor). 

The following section reviews ARFIMA models, which constitute the most 
widely researched class of long memory models, although a variety of alternative 
models have been proposed in the literature that are not covered here (see, 
for example, Ding, Granger and Engle, 1993, and Granger and Ding, 1996). 


4.3.2 ARFIMA models 


Much of the analysis of financial time series considers the case when the 
order of differencing, d, is either zero or one. If the latter, x,is I(1), an EMM 
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process, and its ACF declines linearly. If the former, x, is I(0) and its ACF 
exhibits an exponential decay; observations separated by a long time span 
may, therefore, be assumed to be independent, or at least nearly so. As we 
have seen, J(1) behaviour in the levels of financial time series is an impli- 
cation of many models of efficient markets, and the previous sections of this 
chapter have discussed the analysis of such behaviour in considerable detail. 
Nevertheless, many empirically observed time series, although appearing to 
satisfy the assumption of stationarity (perhaps after some differencing 
transformation), seem to exhibit a dependence between distant observations 
that, although small, is by no means negligible. 

Such series are particularly found in hydrology, where the ‘persistence’ of 
river flows is known as the Hurst effect (see, for example, Mandlebrot and 
Wallis, 1969, and Hosking, 1984), but many financial time series also exhibit 
similar characteristics of extremely long persistence. This may be charac- 
terised as a tendency for large values to be followed by further large values of 
the same sign in such a way that the series seem to go through a succession of 
‘cycles’, including long cycles whose length is comparable to the total sample 
size. 

This viewpoint has been persuasively argued by Mandelbrot (1969, 1972) in 
extending his work on non-Gaussian (marginal) distributions in economics, 
particularly financial prices (see Mandelbrot, 1963b: this is discussed in 
chapter 7) to an exploration of the structure of serial dependence in economic 
time series. While Mandelbrot considered processes that were in the form of 
discrete-time ‘fractional Brownian motion’, attention has focused more 
recently on an extension of the ARIMA class to model long-term persistence. 

We have so far considered only integer values of d. If dis non-integer, x, is 
said to be fractionally integrated, and models for such values of d are referred 
to as ARFIMA (autoregressive fractionally integrated moving average). This 
notion of fractional integration seems to have been proposed independently 
by Hosking (1981) and Granger and Joyeux (1980), and Beran (1992), Baillie 
(1996), Robinson (2003) and Velasco (2006) provide detailed surveys of such 
models. To make the concept operational, we may use the binomial series 
expansion for any real d> —1: 


(4.20) 
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How does the ARFIMA model incorporate ‘long memory’ behaviour? Let 
us first consider the ARFIMA(0, 4,0) process 


(1 - B)’ x = (1 mB — mB? ZDE? = at 


where, using the gamma function IF (n) = (n — 1)!, the coefficients are given by 


rQ- 4) 
r(—d)rG+ 1) 


Tj = 


This process can thus be interpreted as an infinite autoregression. It is often 
referred to as fractional white noise, and is the discrete-time analogue of frac- 
tional Brownian motion, just as the random walk is the discrete-time analogue 
of Brownian motion. 

The ARFIMA model nests the ARIMA model and is able to represent both 
short- and long-term time series behaviour as functions of a single innov- 
ation. By inverting the fractional difference operator, we obtain an MA 
representation 


—d 


x = (1 — B) “a, = (1 — yB — yB? — ---)a 


with 


r(j+ d) 


“Eo r(A rG 1) 


For d=0, x; is simply white noise and its ACF declines immediately to zero, 
whereas, for d= 1, x, is a random walk and hence has an ACF that remains 
(approximately) at unity. For non-integer values of d, it can be shown that 
the ACF of x, declines hyperbolically to zero (see figure 4.3). To be precise, the 
autocorrelations are given by 


ra-4@ DEFA rU a 
r(d) T(k+1-d) Td) 


Pk = 


for large k, so that the autocorrelations exhibit a hyperbolic decay, the 
speed of which depends upon d, and this property is also seen for the z- and 
w-weights. 

The process is SMM for 0<d<1, weakly stationary for d<0.5 and 
invertible for d> -— 0.5. For d> 0.5 the variance of x, is infinite, and so the 
process is non-stationary, but Robinson (1994) refers to it as being ‘less 
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ACFs of ARFIMA(0,d,0) processes with d= 0.5 and d= 0.75 


non-stationary’ than a unit root process, so smoothly bridging the gulf 
between I(0) and I(1) processes. The autocorrelation properties of fractional 
white noise are conveniently set out in Baillie (1996, table 4.2). Smooth 
persistent trending behaviour and positive dependencies are obtained only 
when d is positive. Processes with negative orders of d are characterised by a 
non-smooth form of persistence, or antipersistence, which is associated with 
negative short- and long-range dependencies. 

These same properties are displayed by the more general ARFIMA(p, d, q) 
process, 


$(B)(1 — B)*x; = 0(B)a, 


although parametric expressions for the m- and w-weights are particularly 
complicated: again, see Baillie (1996). The impulse response function is 
defined from 


Ax, = (1 — B)'-“6(B)~'0(B) a, = A(B)a, 
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From Baillie (1996), (1 — Bye = 0 for d< 1, so that any ARFIMA process is 
trend-reverting since A(1) =0. 

The intuition behind the concept of long memory and the limitation of 
the integer-d restriction emerge more clearly in the frequency domain. The 
series x, will display long memory if its spectral density, fx(œw), increases 
without limit as the frequency w tends to zero, 


lim, —ofx(@) = co 


If x,is ARFIMA then f,(w) behaves like w ~’ as @—0, so that d parameterises 
its low-frequency behaviour. When d=1, f,(w) thus behaves like w * as 
w—0 , whereas when the integer-d restriction is relaxed a much richer range 
of spectral behaviour near the origin becomes possible. Indeed, the ‘typical 
spectral shape’ of economic time series (Granger, 1966), which exhibits 
monotonically declining power as frequency increases (except at seasonals), 
is well captured by an I(d) process with 0 < d< 1. Moreover, although the 
levels of many series have spectra that appear to be infinite at the origin, and 
so might seem to warrant first differencing, after such differencing they often 
have no power at the origin. This suggests that first differencing takes out 
‘too much’ and that using a fractional d is therefore a more appropriate form 
of detrending. This difficulty is compounded by the finding that unit root 
tests have even lower power than usual against fractional alternatives. 
Although such tests are consistent when the alternative is a fractionally 
cointegrated process, d= 1 will often be chosen rather than a correct d that is 
less than unity: see Sowell (1990), Diebold and Rudebusch (1991), Hassler 
and Wolters (1994) and Lee and Schmidt (1996). A variety of testing pro- 
cedures have been developed to overcome this lack of power, and some of 
these are discussed in the following section. 

Fractionally integrated processes have the significant drawback of not 
being able to be modelled by finite-order difference equations. This is 
counter-intuitive in most situations in finance, where we would expect agent 
behaviour, as reflected in observed time series, to depend on the near past 
and to affect only the near future. Long-memory models, such as ARFIMA 
processes, involve an infinite memory and thus present a real challenge for 
theorists. Where ARFIMA models seem to make much sense is in terms of 
approximating aggregate agent or time series behaviour. For example, 
Granger (1980) notes that the summation of low-order ARMA processes will 
yield ARMA processes of increasing, and eventually infinite, order that can 
be well approximated using an ARFIMA model. 
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4.3.2 Testing for fractional differencing 


A ‘classic approach to detecting the presence of long-term memory in a time 
series — or long-range dependence, as it is also known — is to use the ‘range over 
standard deviation’ or ‘rescaled range’ statistic. This was originally developed by 
Hurst (1951) when studying river discharges and was proposed in an economic 
context by Mandelbrot (1972). This ‘R/S’ statistic is the range of partial sums of 
deviations of a time series from its mean, rescaled by its standard deviation — i.e. 


i i 


ae Vee 5 na 
Ro = 6 Me a (x; — x) Min A (x; — x) (4.21) 
where 

T 


The first term in brackets is the maximum of the partial sums of the first i 
deviations of x, from the sample mean. Since the sum of all T deviations of the 
x,s from their mean is zero, this maximum is always non-negative. The second 
term is the minimum of the same sequence of partial sums, and hence is always 
non-positive. The difference between the two quantities, called the ‘range’, for 
obvious reasons, is therefore always non-negative: hence Rọ > 0. 

Although it has long been established that the R/S statistic has the ability 
to detect long-range dependence, it is sensitive to short-range influences. 
Consequently, any incompatibility between the data and the predicted 
behaviour of the R/S statistic under the null hypothesis of no long-run 
dependence need not come from long-term memory, but may merely be a 
symptom of short-run autocorrelation. 

Lo (1991) therefore considers a modified R/S statistic in which short-run 
dependence is incorporated into its denominator, which becomes (the square 
root of) a consistent estimator of the variance of the partial sum in (4.21), 


where ô; is of the form (3.12) and may be written as 


2 ; 
a2 _ a2 q oe ae eee) 
a= (1+25 Las) Waj = 1 Eri’ q<T 


142 


The Econometric Modelling of Financial Time Series 


the ft J=1,...34, being the sample autocorrelations of x, as defined in 
chapter 2, section 5.1. Lo provides the assumptions and technical details to 
allow the asymptotic distribution of R, to be obtained. er Re converges in 
distribution to a well-defined random variable (the range of a Brownian bridge 
on the unit interval), whose distribution and density functions are plotted and 
significance levels reported in Lo (1991, fig. I, table II, respectively). The 
statistics are consistent against a class of long-range dependent alternatives 
that include all ARFIMA(p,d,q) models with — 0.5 < d < 0.5. 

The appropriate choice of q (i.e. how to distinguish between short- and 
long-range dependencies) remains an unresolved issue, however (see, for 
example, Pagan, 1996). Moreover, there is evidence that, if the distribution 
of x, is ‘fat-tailed’ (a phenomenon that is discussed in detail in chapter 7), 
then the sampling distribution of R, is shifted to the left relative to the 
asymptotic distribution. This would imply that rejection rates on the left tail 
(rejections in favour of d< 0: antipersistence) are above the nominal sizes 
given by the asymptotic distribution, whereas rejection rates on the right tail 
(rejections in favour of d>0: persistent long memory) are below the 
nominal size (see Hiemstra and Jones, 1997). Lo thus argues that the R/S 
approach may perhaps be best regarded as a kind of portmanteau test that 
may complement, and come prior to, a more comprehensive analysis of 
long-range dependence. 

An obvious approach to testing for fractional differencing is to construct 
tests against the null of either d= 1 or d=0. The ADF and non-parametric 
tests of d= 1 discussed in chapter 3, section 1, are consistent against frac- 
tional d alternatives (see Diebold and Rudebusch, 1991, and Hassler and 
Wolters, 1994), although the power of the tests grows more slowly as d 
diverges from unity than with the divergence of the AR parameter ¢ from 
unity. Similarly, Lee and Schmidt (1996) show that the 7 statistics of 
Kwiatkowski et al. (1992) for testing the null of d= 0 are consistent against 
fractional d alternatives in the range — 0.5 < d < 0.5, and their power com- 
pares favourably to Lo’s modified R/S statistic. 

Alternatively, we may be able to construct tests based on the residuals 
from fitting an ARIMA(p,0,q) model to x, Suppose the fitted model is 
b(B)x; = 6(B)a. Agiakloglou and Newbold (1994) derive an LM test of 
d=0 as the t-ratio on ô in the regression 


Š Pp q 
a = ae B;Wi-i + Dee ¥jZ1-j + OKi(m) + uy (4.22) 
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where 


6(B) W, = Xt > 6(B) Z: = at 
and 
K,(m) =) fae j 


Agiakloglou and Newbold also derive a test based on the residual auto- 
correlations, 7, %%,..., 7m, but this requires a rather greater computational 
effort. They find that, although both tests have empirical size close to 
nominal size, low power is a particular problem when p and q are positive 
rather than zero and when a non-zero mean of x; has to be estimated. 

Mean estimation for long-memory processes is a general problem, as the 
sample mean is a poor estimate of the true mean in these models (see 
Samarov and Taqqu, 1988). Indeed, Newbold and Agiakloglou (1993) also 
find that the SACF of fractional white noise (when d > 0) is a severely biased 
estimator of the true ACF, so that it will be very difficult to detect long- 
memory behaviour from the SACFs of moderate length series. 

Dolado, Gonzalo and Moayoral (2002) proposed a simple Wald-type test 
for fractionally integrated processes that extends the Dickey—Fuller approach 
to the more general case of testing d > dọ against d= d, with dı < dy . When 
d, is not known a priori, they show that the choice of any T"? consistent 
estimator of d,€[0,1) is sufficient to implement the test while retaining 
asymptotic normality. The test, coined FD-F, has acceptable power and 
provides an estimate of d under the alternative hypothesis. The test has the 
advantage of not assuming any known density for the errors and retains the 
simplicity of the standard DF test, where an J(1) null hypothesis can be 
evaluated against some composite alternative. 

The test evaluates the null hypothesis that a series is I(dọ) against the 
alternative that it is I(d,), where dọ and d; are real numbers. By analogy to the 
DF procedure, the testing regression is 


A* x, = pA” x +a 


where a, is an I(0) process. The procedure is based upon testing the statistical 
significance of the coefficient ¢. As with the DF test, A” x, and A“ X;—1 have 
been differenced according to their order of integration under the null and 
alternative hypotheses, respectively. When ¢= 0 the series is fractional white 
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noise, implying that x, is I(d)) under the null. When <0, x, can be 
expressed as 


(A*-4 — gB)A%x, = a 
The polynomial = ae 

e polynomial z(B) = (1 — B) — 
cients with 2(0) =1 and z(1) =—@#0. The non-explosivity condition of 
the polynomial is aoa <o<0. When ¢ <0, A% x1 is 1(0), implying that 
x; is I(dy) under the alternative, whereas if 6=0 then x; is I(dọ) under the 


$B has absolutely summable coeffi- 


null. As with the conventional DF test, the statistic can either be formulated 
as the normalised estimate of @ or its t-statistic. 

When dy=1 and d;=1, we have the conventional unit root testing 
framework of chapter 3 and the asymptotic distributions of the test statistics 
correspond to those derived there. In general, the asymptotic distribution of 
the test statistics depends on the distance between the null and the alternative 
hypothesis and on the nature of the process under the null hypothesis. More 
specifically, the FD-F test statistics will be standard (Gaussian) either if the 
processes under both hypotheses are (asymptotically) stationary or when 
the process is non-stationary under the null, (dọ > 0.5), and (dọ — dı < 0.5). 
The test statistics will be non-standard otherwise. Dolado, Gonzalo and 
Moayoral (2002) provide critical values for three sample sizes 
(T= 100,400, 1000) for the range of values of d for which the test does not 
have a non-standard distribution under the null, for the case of a random 
walk with or without a constant and linear trend. Finally, the authors aug- 
ment the FD-F test, by analogy to the ADF test, and find that the asymptotic 
distribution of the t-statistic remains valid under serial correlation, as long as 
a sufficient number of lags of A” x; are included in the testing regression. 

A wide variety of other Wald-type and LM long-memory testing pro- 
cedures have been developed in the literature. Wald-type tests (Geweke and 
Porter-Hudak, 1983; Fox and Taqqu, 1986; Sowell, 1992a; Robinson, 1992) 
work under the alternative hypothesis and are based on point estimates and 
confidence intervals of d. The parametric and semi-parametric methods of 
estimating d, both in the frequency and the time domains, have been found 
to possess low power when used for testing purposes. The LM approaches, 
such as those proposed by Robinson (1994) and Tanaka (1999), in the 
frequency and time domains, respectively, evaluate the statistics under the 
null hypothesis. They have the advantage, unlike unit root tests, of having 
standard asymptotic distributions, but do not provide estimates about d 
when the null hypothesis is rejected. 
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Tieslau, Schmidt and Baillie (1996) and Chong (2000) have suggested 
estimating d and testing for fractional integration by minimising the dif- 
ference between the sample and population autocorrelations and partial 
autocorrelations, respectively. Delgado and Velasco (2005) propose sign- 
based tests for simple and composite hypotheses on d. Giraitis et al. (2003) 
develop a rescaled variance test which has a simpler asymptotic distribution 
and a better balance of size and power than Lo’s modified R/S test and the 
KPSS test of Kwiatkowski et al. (1992). 

It is important to note that, as shown by several authors (Diebold and 
Inoue, 2001; Engle and Smith, 1999; Granger and Hyung, 2004; Dittmann 
and Granger, 2002), long-memory tests are sensitive to a variety of factors, 
such as structural breaks, outliers, regime switching and non-linear trans- 
formations. 


4.3.3 Estimation of ARFIMA models 


The fractional differencing parameter d can be estimated by a variety of 
methods. The earliest approach was to use the result in R/S analysis that 
plim,_,,,T~?~°> Ry is a constant (Lo, 1991) and estimate d as 


_In(R)) 1 


ao 2 


Another early and popular approach was the log-periodogram regression 
proposed by Geweke and Porter-Hudak, 1983 (GPH). The spectral density of 
x; is given by 


fel) = |1 — exp(—ie)|** flo) = (4sin?(w/2)) 


—d 


fu(@) 
where f,,(@) is the spectral density of w,= (1 — B)4 x, It then follows that 
In( f.(@)) = In( fy(@)) — dln (4sin’ (w/2)) 


and, given the sample {a}, this leads GPH to propose estimating d as 
(minus) the slope estimator of the least squares regression of the periodogram 


Ir (aj) = 263(1 £25 T; COS (sen; )) 


on a constant and In(4sin? (@;/2)), at frequencies œ; = 27j/T, j=1,..., K, 
where typically K= [T°] (other choices for K are discussed in Baillie, 1996). 
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A critical assumption of the GPH estimator is that the spectrum of the 
ARFIMA(p, d, q) process is the same as that of an ARFIMA(0, d, 0) process. 

Robinson (1995a), Hurvich, Deo and Brodsky (1998), Tanaka (1999), 
Velasco (1999) and Lieberman (2001) have analysed the GPH estimator din 
great detail. Under the assumption of normality for x,, it has now been proved 
that the estimator is consistent for —0.5 < d< 1 and asymptotically normal, so 


that the estimated standard error attached to d can be used for inference. 
Alternatively, the asymptotic result that VK (a — d) ~ N(0, 77/24) may be 


used. 
An alternative semi-parametric estimator has been proposed by Robinson 
(1995b). This is defined as d = arg ming R(d), where 


R(d) = log( K“ > oF" Ir(w;) ) — (2d/K) oe Wj 


Inference on d uses the result that VK (d — d) ~ N(0, 1/4). Once an estimate 
of d has been obtained, the series A“x, may be constructed by using (4.20) and 
truncating appropriately, and the ARMA part of the model can then be 
identified and estimated using the techniques outlined in chapter 2. Hurvich 
and Ray (1995) have extended the GPH estimator to the case of a non- 
stationary, non-invertible process. More recently, Andrews and Sun (2004) 
have proposed an alternative semi-parametric spectral approach, based on a 
generalisation of the local polynomial Whittle estimator of Künsch (1987), 
which achieves the optimal rate of convergence. Sun and Phillips (2003) 
propose a non-linear log-periodogram regression approach to accommodate 
fractional processes that are perturbed by weakly dependent series and allow 
different sources of variation in the short- and long-run dynamics. 

The standard approach following the estimation of d is to fit an ARMA 
model to the fractionally differenced data. The filtered data are often 
assumed to be normally distributed in order to employ standard estimation 
and inference when analysing the final ARFIMA model. Not surprisingly, 
semi-parametric estimators of d can often perform poorly in terms of bias 
and MSE when the orders p and q are known. 

An alternative estimator of d is the approximate ML estimator of Beran 
(1995). The test assumes an ARFIMA(p,d,q) process with d= m + ô, 
— 0.5 < ô< 0.5, where m is a known integer denoting the number of times 
the series must be differenced in order to attain stationarity. The estimator 
is formed by fitting an ARFIMA model for various values of ô and selecting 
the parameter d=m-+6 that minimises the sum of squared residuals. 
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Obviously, the choice of m is crucial, since the method yields asymptotically 
normal estimates of the ARFIMA parameters if —0.5 < ô < 0.5. 

Significant attention has also been given to joint ML estimation of all the 
parameters in the ARFIMA(p,d,q) model, as developed by Sowell (1992a, 
1992b) and surveyed by Baillie (1996). These methods do have the drawback, 
however, that specialised software is required, and they entail particularly 
cumbersome computations for higher-order ARMA specifications and for 
even moderate sample sizes. Moreover, it can be difficult to identify the correct 
values of p and q at the same time as determining the value of d. Nonetheless, 
ML routines for ARFIMA models are provided in PcGive 11 (Hendry and 
Doornik, 2006), for example, along with NLS and other estimation techniques. 

Evidence of long memory has been found in a variety of financial time series, 
including stock returns, exchange rates and interest rates (see Baillie, 1996). 
Little has been said about the relative usefulness of the ARFIMA model for 
forecasting purposes, however. Bhardwaj and Swanson (2006) have recently 
analysed the predictive ability of ARFIMA models using three previously 
analysed financial and macroeconomic data sets. They employ a variety of 
long-memory testing and estimation procedures and evaluate the ex ante 
forecasting ability of ARFIMA models against standard short-memory models. 

Their results show that ARFIMA models are able to approximate the true 
data-generating process and sometimes to perform significantly better in 
out-of-sample forecasting than simple short-memory models. Samples 
exceeding 5000 observations are sufficient to provide very stable rolling and 
recursive estimates of d, although samples of fewer than 2500 observations 
substantially increase estimation error. Moreover, it appears that ARFIMA 
models may be particularly useful at longer forecasting horizons. Finally, 
they observe, against conventional wisdom, that ARFIMA models were often 
superior in terms of forecasting ability even though they are less parsimo- 
nious than ARMA models, in terms of the additional parameter d and the 
ad hoc application of the truncation filter. 


Example 4.5 Long-memory and fractional differencing in exchange 
rates and stock returns 
In example 3.1 we confirmed that the dollar/sterling exchange rate contains a 
unit root, while in example 3.2 we confirmed that this was also the case for 
the FTA All Share index. We now consider whether the differences of the two 
series, the returns, are really stationary or whether they exhibit long memory. 
We first compute the modified R/S statistic, TR for the exchange 
rate differences. Lo (1991) recommends choosing q as [T°*?] = 8, as with 
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the non-parametric unit root statistic. Using this setting we obtain 
T! R = 1.092, and, since a 95 per cent confidence interval for this statistic 
is (0.809, 1.862) (see Lo, 1991, table II), we cannot reject the hypothesis that 
exchange rate returns are short memory. This finding is confirmed by both 
LM tests and the semi-parametric estimates of d. Since the levels are 
effectively a driftless random walk (see example 2.5), LM tests were con- 
structed using Ax; (4.22) therefore reduces to a regression of Ax, on 
K,(m) = D Ax;—j. The calculated t-ratios for ô are around —1 for m set 
equal to 25, 50, 75 and 100, respectively. The GPH estimate computed using 


K= [T*"] = 667 is d = 0.005 with a standard error of 0.03, while d = 0.01. 

For the FTA All Share returns, we obtain TAR = 1.504, which is again 
insignificant. The GPH estimate is d = 0.05 with a standard error of 0.06, 
computed using K= 142, while d is only 0.01, with a standard error of 0.04. 
The LM t-ratios were never remotely significant for a wide range of m values. 
This is, in fact, consistent with the simulation results of Agiakloglou and 
Newbold (1994), who find that the power of this test to reject d= 0 is very 
weak when the sample mean (the drift in the index here) has to be estimated. 

As a further example, we investigate the daily returns for the S&P 500 
index from January 1928 to August 1991, a total of T= 17,054 observations, 
a series that was originally analysed in Ding, Granger and Engle (1993). The 
GPH estimate for the returns is d = 0.11, with a standard error of 0.06, so 
there is little evidence that the series is long memory. For the squared returns 


series we obtain d = 0.56, however, while for the absolute returns we obtain 


d = 0.73. Thus, simple non-linear transformations of returns do appear to 
be long memory, and this is also found to be the case for a wide variety of 
other financial series: see the results in Ding and Granger (1996), Granger 
and Ding (1996), and Mills (1996a, 1997a). These types of models are ana- 
lysed in greater detail in the next chapter. 


Example 4.6 An ARFIMA process for US Treasury bills 

As has been seen in figure 4.2, weekly observations on the US Treasury bill 
rate demonstrate locally trending behaviour, reaching a peak during 
December 1980 of 16.76 per cent. Both ADF and Phillips—Perron tests 
(including a constant but no trend) cannot reject the null hypothesis of a 
unit root in the data, the test statistics being —1.91 and —2.13, respectively. 
The notion of non-stationary interest rates contradicts financial theory and 
the restriction that they are bounded by zero, however. 
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Figure 4.4 SACF of three-month US Treasury bills 
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Figure 4.5 Fractionally differenced (d= 0.88) three-month US Treasury bills (monthly April 1954-Februray 2005) 


The autocorrelation function of the series, shown in figure 4.4, suggests 
some type of long-memory process. The KPSS test strongly rejects the null 
hypothesis of I(0) with a statistic of 0.651, much larger than the 5 per cent 
critical value of 0.463. Further evidence of a long-memory process is provided 
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by Lo’s modified R/S test statistic of 4.709, which is a highly significant value. 
Estimation of the fractional differencing parameter using the semi-para- 
metric approaches of GPH and Robinson (1994) produces estimated d values 
of 0.905 and 0.487, respectively. These suggest that simple differencing, as 
suggested by the standard unit root tests, may remove too much information 
from the data. 

On the basis of these tests, joint Gaussian ML estimation of the param- 
eters of an ARFIMA model was attempted. The best fit was finally offered by 
the following ARFIMA(0, d, 1) model with d estimated to be 0.88 (f-ratios in 
parentheses): 


A‘ x, = 2.279 +å, — 0.593 a;_1 
(30.89) (4.90) 


The fractional differencing parameter is found to be close to the GPH esti- 
mate and is highly significant with a t-ratio of 9.47. It is interesting to observe 
the now clearly stationary fractionally differenced Treasury bill series in 
figure 4.5. 


Univariate non-linear stochastic 
models: martingales, random 
walks and modelling volatility 


As we have seen in previous chapters, financial time series often appear to be 
well approximated by random walks. The relationship between random 
walks and the theory of efficient capital markets was briefly discussed in 
chapter 1, where it was argued that the random walk assumption that asset 
price changes are independent is usually too restrictive to be consistent with 
a reasonably broad class of optimising models; what is in fact required is that 
a variable related to the asset price be a martingale (see, for example, 
Andreou, Pittis and Spanos, 2001). 

Martingales and random walks are discussed formally in section 5.1, with 
tests of the random walk hypothesis being the subject of section 5.2. The 
relaxation of the assumption that changes in a time series must be inde- 
pendent and identically distributed allows the possibility of examining non- 
linear stochastic processes, and the remainder of the chapter therefore 
introduces various non-linear models that are now used regularly in ana- 
lysing the volatility of financial time series — a fundamental concern of 
financial modellers. Ways of measuring volatility itself are introduced in 
section 5.3, before formal stochastic models are considered. Stochastic 
volatility (SV) models are discussed in section 5.4, ARCH processes in 
section 5.5, further models related to ARCH in section 5.6, and section 5.7 
briefly looks at the forecasting performance of alternative volatility models. 


5.1 Martingales, random walks and non-linearity 
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A martingale is a stochastic process that is a mathematical model of ‘fair 
play’. The term ‘martingale’, which also denotes part of a horse’s harness or a 
ship’s rigging, refers in addition to a gambling system in which every losing 
bet is doubled — a usage that may be felt to be rather apposite when con- 
sidering the behaviour of financial data! 
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A martingale may be formally defined as a stochastic process {x,} having 
the following properties: 
(a) E(|x;|)<oo for each t 
(b) E(x,|Ss) = x» whenever s<t, where S, is the o-algebra comprising 
events determined by observations over the interval [0, t], so that S, C 
S, when s< t. This is known as the ‘martingale property’. 
While the ‘history’ {S;}, can, in general, include observations on any 
number of variables, it is often restricted to be just the past history of {x} 
itself — i.e. S, = o(x,;s < t). Written as 


E(x, — x| S) =0, s<t (5.1) 


the martingale property implies that the MMSE forecast of a future incre- 
ment of a martingale is zero. This property can be generalised to situations, 
quite common in finance, where 


E(x — |S) 20, s<t 


in which case we have a submartingale, and to the case where the above 
inequality is reversed, giving us a supermartingale. 
The martingale given by (5.1) can be written equivalently as 


Xp = X-1 + at 


where a, is the martingale increment or martingale difference. When written 
in this form, the sequence {x;}j looks superficially identical to the random 
walk, a model that was first introduced formally in chapter 2. There a, was 
defined to be a stationary and uncorrelated sequence drawn from a fixed 
distribution, i.e. to be white noise. As was discussed in chapter 2, section 4, 
however, alternative definitions are possible: a, could be defined to be strict 
white noise, so that it is both a stationary and an independent sequence, 
rather than just being uncorrelated. Moreover, it is possible for a, to be 
uncorrelated but not necessarily stationary. While the white-noise assump- 
tions rule this out, such behaviour is allowed for martingale differences; this 
implies that there could be dependence between higher-conditional 
moments, most notably conditional variances. 

The possibility of this form of dependence in financial time series, which 
often go through protracted quiet periods interspersed with bursts of tur- 
bulence, leads naturally to the consideration of non-linear stochastic pro- 
cesses capable of modelling such volatility. Non-linearity can be introduced 
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in many other ways, however, some of which may violate the martingale 
model. As an illustration, suppose that x; is generated by the process Ax;= np 
with n, being defined as 


Ny = a, + Bay_1a;-2 


where a; is strict white noise. It follows immediately that n, has zero mean, 
constant variance and ACF given by 


E(7:M1+-x) =E(a,a; k + Bar—141—-24+—-k + Barar—k—14t—k-2 


+B? ay—14—24)—k-1 4k 2) 


For all kÆ 0, each of the terms in the ACF has zero expectation, so that, as far 
as its second-order properties are concerned, 7, behaves just like an inde- 
pendent process. The MMSE forecast of a future observation, 7,41, is not zero 
(the unconditional expectation), however, but is the conditional expectation 


ev = E[N M-1,---) = Barar1 
It then follows that x, is not a martingale, because 
E(x — Xene Mm) = Hear FO 


and the non-linear structure of the 7; process could be used to improve the 
forecasts of x, over the simple ‘no change’ forecast associated with the mar- 
tingale model. 


5.2 Testing the random walk hypothesis 


Notwithstanding the above discussion, the random walk model has played a 
major role in the empirical analysis of financial time series: see, for example, 
the seminal research of Fama (1965) and Granger and Morgenstern (1970). 
In chapter 3 we examined various tests of an observed time series being a 
random walk. In the main these were developed by assuming that there was a 
specific alternative to the random walk null, such as the stationary AR(1) 
process used for expository purposes in chapter 3, section 1.2, but also 
advanced by Shiller (1981a) as a model of stock market fads, and the Poterba 
and Summers (1988) UC model in which this AR(1) process is added to a 
pure random walk. There have also been numerous other tests developed 
against a variety of different alternatives, some of which we now discuss. 
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5.2.1 Autocorrelation tests 


Using the results stated in chapter 2, section 5.1, if w,= Ax; is strict white 
noise then the asymptotic distribution of the sample autocorrelations 
(standardised by VT) calculated from the realisation fw}, will be N(0,1), 
so that the random walk null would be rejected at the 5 per cent significance 
level if, for example, VT|r;| > 1.96. 

If a set of sample autocorrelations are considered, say n,...,7rx, then 
some will probably be significant even if the null is true: on average one out 
of twenty will be significant at the 5 per cent level. As noted in example 2.1, 
the portmanteau statistics Q*(K) and Q(K) may be used in these circum- 
stances. On the random walk null, both statistics are distributed as Xa so 
that the null would be rejected for sufficiently high values. Note that these 
tests do not require a specific alternative hypothesis; they may thus be 
regarded as ‘diagnostic’ tests with, hopefully, some power against the null for 
a wide range of alternatives. 

The tests do, however, require that the innovations to the random walk be 
strict white noise. If the innovations are merely uncorrelated, rather than 
independent, then the above testing procedure will be unreliable. To show 
this, relax the strict white-noise assumption on w, to that of just satisfying the 
weak dependence conditions (3.11). In this case, V Tri ~ N(0, t°), where 


2 
Tt =0y *(V( wiw +25 , Cov( (W1 Wa, Wis Wi42) ) 


(Romano and Thombs, 1996, theorem 2.1). An example of such a process is 


W= Z,Z_1, Where z, is itself zero-mean strict white noise with E (z?) = o? 


and E(z4) <o. It is therefore clear that, for all i> 0, 


Cov(w w2, Wi+1 Wit2) = 0 


V(wiw.) = E(wew3) = (E(2))* - E(z*) = 04 - E(z*) 


and 
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For example, if the z, are standard normal, tT’ = 3, and, in general, t3 can be 
made arbitrarily large. Hence, a test of zero correlation based on, say, 
VT\|r;| > 1.96 will lead to a high probability of incorrectly rejecting the 
hypothesis of zero correlation. 

It is straightforward to construct examples for which t° < 1. Suppose that 
Elz; 2) <oo. Then w,= ZJZ}: will again be uncorrelated, but now 


V(mw) = V(a/z) = E(z) E(1/z,) = Vim) 


Thus, 


v= wen = (V(m)) = (E(z{)E(1/2t)) "<1 


Romano and Thombs (1996, example 3.5) show that, if w, is no longer strict 
white noise, then Q*(K) is no longer asymptotically distributed as xj. For 
example, if w,= z,z,;1, then Q*(K) is distributed as a weighted sum of 
independent x? variates, leading to a rejection probability greater than the 
nominal significance level using the x% distribution. 

Three approaches have been developed in response to this problem. The 
first is to modify the test statistic. Lobato, Nankervis and Savin (2001, 2002) 
propose modifying the portmanteau statistic to 


NST... (rfv) ~ 


where 
T 
=I 
y= T J — (wri — Ôw 
F ete, (w: w) t-i w /64 


They also propose further extensions based on considering the covariance 
matrix of the set of sample autocorrelations 1,,1,...,rg¢: see the above 
references for details. The second approach is to continue to use Q*(K) but 
to estimate its distribution using bootstrap simulation techniques: see 
Horowitz et al. (2006). Kim, Nelson and Startz (1998) propose a similar 
approach for the variance ratio statistic, in which w;is first standardised by a 
simulated set of conditional variances. 

The third approach is to construct non-parametric versions of the tests. 
Wright (2000), for example, develops variance ratio tests based on the ranks 
and signs of w, Thus, if r(w,) is the rank of w, in the sample {m}, then the 
standardisation 
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ensures a series with zero mean and unit variance. The rank-based variance 


ratio test is then defined as 


R(k) = 


eae Oo ni) aie a - 1)(k— ny (5.2) 


oS ae r? 3kT 


Wright provides critical values for the distribution of this statistic under the 
random walk null for a variety of choices of k and T and also proposes a 
second statistic based on an alternative transformation of r(w;,). 

For a test based on signs, let 


hee 1 ifw>0 
eS |1 ifwm<0 


A test statistic S can then be defined analogously to R in (5.2). This statistic 
assumes that w, has zero mean. If not, Wright provides a related test statistic, 
and the statistics have been further generalised by Luger (2003). 


5.2.2 Calendar effects 


As remarked above, autocorrelation tests are generally diagnostic checks 
aimed at detecting general departures from white noise and do not consider 
autocorrelations associated with specific timing patterns — i.e. patterns 
associated with ‘calendar effects’. There has been a great deal of research 
carried out in recent years on detecting such effects. To date, researchers have 
found evidence of a January effect, in which stock returns in this month are 
exceptionally large when compared to the returns observed for other 
months; a weekend effect, in which Monday mean returns are negative rather 
than positive, as for all other weekdays; a holiday effect, showing a much 
larger mean return for the day before holidays; a turn-of-the-month effect, in 
which the four-day return around the turn of a month is greater than the 
average total monthly return; an intramonth effect, in which the return over 
the first half of a month is significantly larger than the return over the second 
half; and a variety of intraday effects. 

Early reviews of these ‘anomalies’ are Thaler (1987a, 1987b) and Mills 
and Coutts (1995), while Schwert (2003) provides a more recent survey and 
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additional evidence. A wide range of statistical techniques have been employed 
to detect such anomalies, and discussion of them here would take us too far 
afield from our development of formal time series models. Taylor (1986, 
pp.41-4) discusses some of the techniques, and the interested reader is 
recommended to examine both this and further papers cited in the above 
references. 


5.3 Measures of volatility 


Since the seminal work of Markowitz on portfolio theory, volatility has 
become an extremely important variable in finance, appearing regularly in 
models of asset pricing, portfolio theory, risk management, etc. Much of the 
interest in volatility has to do with its not being directly observable, and 
several alternative measures have been developed to approximate it empir- 
ically, the most common being the unconditional standard deviation of 
historical returns. Despite being convenient and simple, this measure is 
severely limited by the fact that returns are typically non-iid (independent 
and identically distributed), with distributions that are leptokurtic and 
skewed (see chapter 7). Moreover, the standard deviation may not be an 
appropriate representation of financial risk, so that some other measure 
should be used, such as the semi-variance (see the review by Nawrocki, 1999) 
or the absolute deviation (see Granger and Ding, 1995). 

In practice, the standard deviation is typically calculated from a sample of 
daily close-to-close logarithmic returns using the ML estimator of the variance 


AM is] n 2 
Om, =" T 


where r,= p;— p1 is the logarithmic return, p, being the logarithm of price, 
and n> 1 is the sample size. Standard results show that ĉĝ;z has the following 
asymptotic distribution: 


Valé? — 0) & N(0, 204) 


2. ‘ i ‘ A r 
where o° is the true return variance. A mean-adjusted estimator is easily 
obtained as 


O. no 2 log(Pn/ po) 
e-4(rLe n 


where po and p, are the first and last price observations in the sample. 
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The term ‘volatility’ in finance typically corresponds to the annualised 
historical standard deviation of returns. Assuming that the underlying 
logarithmic price process is a random walk, this can be calculated by 
multiplying the standard deviation by the square root of time. This implies 
that uncertainty, or risk, as measured by volatility, increases with the square 
root of the time horizon. For example, if the standard deviation for a stock is 
estimated from historical daily returns to be 0.002, and assuming that a 
calendar year corresponds to a business year of 250 trading days, volatility 
will be 0.002\/250 = 31.62 per cent. A word of caution is put forward by 
Diebold et al. (1998), who demonstrate that volatility estimates may depend 
significantly on the sampling frequency of the returns used to estimate the 
standard deviation, caused by returns not necessarily being id. 

In addition to the historical standard deviation, several extreme value 
estimators have also been proposed. These try to improve efficiency by 
exploiting the information contained in the opening, closing, high and low 
prices during the trading day. For example, assuming a driftless and con- 
tinuous price process, Garman and Klass (1980) propose the following 
minimum-variance unbiased extreme-value estimator: 


get 5 0.511 log(H;/L;)”—0.383 log(C;/ O)? 
=” \ 2. \ _0.019(log(C,/0,) log(H;L; /O2) — 2log(H;/O,) log(L:/0:)) 


where H, L, O, and C, are the highest, lowest, opening and closing price for 
each trading day. It can be shown that this estimator has a sampling variance 
that is almost 7.5 times smaller than that of the historical standard deviation. 
Several alternative extreme-value estimators have been developed that extend 
the information set and relax the assumptions of this estimator (see Bali and 
Weinbaum, 2005). The empirical literature has shown that, although 
extreme-value estimators generally perform well in terms of efficiency, they 
often suffer from bias when compared to the simple historical standard 
deviation approach. 

An interesting development in volatility measurement has been the 
emergence of the integrated or realised variance non-parametric estimator. 
This has become extremely popular over the past decade after a series of 
papers by Andersen, Bollerslev, Diebold and co-authors, and Barndorff- 
Nielsen and Shephard. For reviews of this growing literature, see Barndorff- 
Nielsen, Gravesen and Shephard (2004) and Andersen, Bollerslev and 
Diebold (2007). Realised variance, often termed realised volatility (RV), 
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measures the quadratic variation of the underlying diffusion process in 
continuous time. In discrete time, it can be conveniently estimated by taking 
the sum of M squared returns within a fixed time interval: 


ADE o T 2 
Or = ` r 


Realised volatility can be then calculated as ôg. Assuming that the underlying 
process is a semi-martingale and that arbitrage is not possible, it can be 
proved that, as M — œœ, RV is a uniformly consistent estimator of the 
unobserved, true variability of the process and that, under certain additional 
conditions, it provides unbiased estimates. Barndorff-Nielsen and Shephard 
(2005) use log-based limit theory to show that volatility based on the RV 
estimator has the following asymptotic distribution: 


s(t, e) 18 eg,» 
A 


RV measures total variation without requiring that the underlying process 


is free of discontinuous jump behaviour. Empirical research on jumps has 
shown that they may play a significant role and may account for a non-trivial 
part of the total variability. Jumps are particularly important in finance, since 
they play a critical role in risk management and option pricing (see 
Psychoyios, Dotsis and Markellos, 2006, and Huang and Tauchen, 2005). In 
the context of realised variance, it is possible to disentangle the effect of 
jumps on the total variation of the process. This can easily be accomplished 
in a non-parametric manner by employing the realised bipower variation 
(BPV) proposed by Barndorff-Nielsen and Shephard (2004): 


M 
T 
AA 


Simulations by Huang and Tauchen (2005) show that an empirically robust 
measure of the relative contribution of jumps to the total price variation is 
given by the relative jump statistic RJ = (GR = BPV) / Gp, or the corres- 
ponding logarithmic ratio J = log(ô}) — log( BPV). 

Realised variance allows us, in the continuous time limit, to approximate 
the ex post, instantaneous variance over any time interval, and to any desired 
degree of accuracy, by just sampling at sufficiently high frequencies. At first 
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glance, this is a particularly powerful estimator, since it appears to be model- 
and error-free. In practice, however, only finite samples of discrete data are 
available. Sampling at very high frequencies may be possible but, unfortu- 
nately, may introduce several well-known microstructure biases related to, 
for example, bid—ask bounce, screen fighting, price discreteness, irregular 
spacing of quotes, illiquidity, seasonalities, etc. Empirical research has shown 
that a sampling interval of thirty minutes typically offers a good balance 
between increasing sampling frequency and reducing microstructure effects. 
RV is now widely used as a proxy for unobservable volatility to evaluate the 
performance of volatility estimators. For example, Bali and Weinbaum 
(2005) evaluate the performance of several extreme-value estimators against 
historical volatility using daily, weekly and monthly data on equities and 
exchange rates. Using RV as a proxy for unobserved volatility, they find that 
the extreme-value estimators were less biased and more efficient than the 
historical standard deviation, especially at the daily level. Moreover, the 
Garman and Klass estimator has been found to have one of the best per- 
formances amongst competing extreme-value estimators. 

Another class of estimator is based on the volatility obtained by inverting 
financial option pricing formulae using observed option prices. Let us 
assume that the price x, for an option contract can be calculated from the 
following pricing model: 


Xt =F (Ses T— t, ft, o) 


where S, is the spot price of the asset underlying the option, T— tis the life or 
time to maturity of the option, r, is the risk-free rate and ø is the volatility of 
the asset. All the variables in the model, except for volatility, are observable 
and can be approximated using market data, so that a proxy for volatility can 
also be estimated by inverting f. Although these implied volatility estimators 
are widely used, especially in the financial industry, they depend on the 
pricing model chosen. Moreover, they do not always provide a single vola- 
tility estimate across various option prices for the same asset. 

A related development has been the treatment of volatility as a distinct 
asset that can be packaged in an index and traded using volatility swaps, 
futures and options. Traditionally, derivatives have allowed investors and 
firms to hedge against factors such as market, interest rate and foreign 
exchange volatility. Volatility derivatives provide protection against volatility 
risk — i.e. unexpected changes in the level of volatility itself. The first volatility 
index, the VIX (currently termed VXO), was introduced in 1993 by the 
Chicago Board Options Exchange (CBOE). Since 2003 the VIX has been 
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calculated as the average implied volatility of out-of-money option prices 
across all available strikes on the S&P 500 index. Several other implied 
volatility indices, in the United States and elsewhere, have since been 
developed, with the financial press regularly quoting the VIX as an “investor 
fear gauge’. A number of recent empirical studies have examined the 
properties of implied volatility indices. For example, Psychoyios, Dotsis and 
Markellos (2006) analyse daily data on the VIX index over a period of ten 
years and find evidence of mean reversion, heteroskedasticity and jumps. 
Although the VIX is found to be stationary, the possibility of long memory 
cannot be excluded. 

Empirical research has shown that any measure of volatility exhibits 
persistent variations through time, and this has motivated the development 
of reduced-form forecasting models. These models are sometimes called 
autonomous, since they model volatility alone without reference to models 
for the conditional mean return. A naive, yet common, practice is to estimate 
volatility as a simple or exponential moving average of a rolling window of 
past squared returns or shocks. 

A more sophisticated approach is to model the observable volatility 
proxies using standard time series methods. For example, Taylor (1986) 
estimates ARMA models for absolute and squared returns (see also Granger 
and Ding, 1995). Andersen et al. (2003) introduce formal links between 
realised volatility and the conditional covariance matrix of returns and 
estimate various AR and ARFIMA models directly on the RV, while Engle 
and Gallo (2006) model volatility by jointly considering measures such as 
absolute daily returns, the daily high-low range and realised volatility. Since 
all these measures correspond to non-negative series, Engle and Gallo 
develop a multiplicative error model that is consistent and asymptotically 
normal under a wide range of error specifications. Finally, an alternative 
approach is to use continuous-time models to describe the dynamics of 
implied volatility indices. Psychoyios, Dotsis and Markellos (2006) estimate 
a variety of such models with data on the VIX and find that the widely used 
mean-reverting square root process can be significantly improved by the 
addition of jumps. Model performance is found to be further enhanced if 
jumps are conditioned on the level of the index. 

It should be emphasised that all the volatility proxies described above are 
calculated from a finite set of data and are obviously subject to sampling 
error. The use of noisy volatility estimates in financial models gives rise to an 
interesting econometric problem involving estimation risk, whereby the 
model is valid but the input parameters are uncertain (see, for example, 
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Gibson et al., 1999). For example, within option pricing, the use of estimates 
in place of the true, but unknown, volatility in the Black-Scholes formula 
involves estimation risk as the estimate of the variance affects the estimate of 
the corresponding option price. An additional complication arises from the 
fact that even an unbiased estimate of volatility will not necessarily produce 
an unbiased estimate of the option price, since option pricing models 
are highly non-linear with respect to volatility (for a description of this 
literature, see Dotsis and Markellos, 2007, and the references therein). 


Example 5.1 Measuring the volatility of the DJI 

In this example we compare the volatility estimates obtained using some of 
the approaches discussed above. We use daily opening, closing, high and low 
prices for the Dow-Jones Industrial Average index (DJI) for 1996, a total of 
254 observations. We also utilise intradaily data, provided by the Olsen and 
Associates database, on the DJI. This data set corresponds to the average between 
the bid and ask price sampled at thirty-minute intervals. Since the market is 
open between 10 a.m. and 5 p.m., we have fifteen prices per trading day, or a total 
of 3810 observations. The SACF and Q-statistics indicate no autocorrelation 
in the returns. Significant autocorrelation is found in both the absolute and 
squared returns, however, suggesting that volatility may be predictable. 

Using the standard deviation of daily logarithmic returns calculated from 
closing prices we obtain a volatility estimate of 12.01 per cent. If opening prices 
are used instead, we obtain a somewhat lower volatility estimate of around 
11.7 per cent. Such differences can be caused by well-known microstructure 
effects that are present in the opening and closing of the market (see, for 
example, Stoll and Whaley, 1990). Based on the standard deviation of the 
intraday returns we obtain a volatility estimate of 11.97 per cent. The differ- 
ence between the volatility estimated using the daily and intradaily intervals 
is very small here, since the effect of microstructures is not so pronounced at 
the thirty-minute sampling frequency. The Garman and Klass (1980) extreme- 
value volatility estimator gives a much higher average volatility over the 
complete period of 21.06 per cent. We treat this estimate with some caution, as 
we know that, although the sampling error of this estimator may be small, 
extreme-value estimators can demonstrate bias. 

We have also estimated realised volatility for each day of the sample using 
the intradaily data. The time series of annualised realised volatilities is 
shown in figure 5.1. Although the unconditional distribution of returns is 
non-normal, daily returns standardised by RV estimates are almost normally 
distributed. As discussed by Barndorff-Nielsen and Shephard (2004), 
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non-normality of RV-standardised returns is an indication of the importance 
of incorporating jumps and innovation volatility interactions into volatility 
forecasting models. Another interesting point to examine is whether, as 
predicted by standard financial theory, returns and volatility have an inverse 
linear relationship. As shown in figure 5.2, and in line with empirical 
research, we find an asymmetry in this relationship: RV is negatively 
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(positively) correlated with negative (positive) returns. We have also used 
data on the VIX index to assess the volatility of the DJI. Although the VIX is 
constructed from options on the S&P 500 index, we can reasonably assume 
that the volatility of this index is close to that of the DJI. Using daily data on 
the VIX for 1996 we obtain an average volatility estimate of 16.46 per cent. 

An attempt was made to model the RV and VIX using ARMA processes. 
The SACFs indicate strong serial dependencies in both series, and ARMA 
(1,1) and AR(1) models offered the best description of the RV and VIX 
series, respectively: 


RV, = 0.108 + 0.841 RV;_; — 0.651 €+_) + E: 
(0.006) (0.080) (0.113) 


R? = 11.10% 


and 


VIX, = 0.167 + 0.868 VIX,_; + €t 
(0.004) (0.031) 


R? = 75.21% 


Although both models imply positive persistence in the volatility process, the 
VIX is found to be far more predictable, with an R? of around 75 per cent. 
Although autocorrelation is effectively removed, the distribution of the 
residuals from both models is highly non-normal, mostly due to the exist- 
ence of a few large, positive errors. 

Finally, we compare the performance of the Garman and Klass and VIX 
estimators in terms of their proximity to daily realised volatility. The 
extreme-value estimator has a much stronger correlation with realised 
volatility than the VIX. The VIX has a much smaller error, however, since 
RMSEv ix = 0.072 compared to RMSEgcx = 0.104. Thus, we can conclude that 
the VIX is a superior estimator of volatility for this particular sample. 


Example 5.2 A jump diffusion model for the VIX implied volatility index 
Using VIX daily prices from January 1990 to September 2005 we estimate 
the parameters of the mean-reverting square root process augmented by 
upward jumps (SRJ). This continuous-time model has been used to model 
the autonomous dynamics of implied volatility indices, and assumes that 
volatility follows the process 


AVIX, = a(u — VIX;)dt + ovx VIX, aB, + ydq, 
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where dB, is a standard Brownian motion, dq, is a compound Poisson 
process and y is the jump amplitude. In the SRJ process, dq, has constant 
arrival parameter h with P(dq,=1) = hdt, and dB, and dq, are assumed to 
be independent processes. We further assume that the jump amplitude y 
follows an exponential distribution and is restricted to be positive: 
f(y) = ne-™ 14,50}, where 1/n is the mean of the jump. 

This exponential distribution allows us to capture upward jumps in 
implied volatility and to derive the characteristic function in closed form. 
The mean and the variance of the process are given by 


h 
EV Xe) = Vie eee) 
an 
Vie a an Olt Zä 
V(VIX14.) S (hee) a S (= ey 
hory > h 
1 — p QT 1 — —2aTt 
+ 2ng? ( € ) Ea e ) 


where T is a discrete time interval. The density functions of these models can 
now be obtained via Fourier inversion of the characteristic function. 
Maximising the likelihood function, though computationally intensive, 
provides asymptotically efficient estimates of the unknown parameters (see 
Singleton, 2001). Fourier inversion, using a Gauss—Legendre quadrature, of 
the characteristic function provides the necessary transition density function 
and the log-likelihood function. Using the VIX sample, the following par- 
ameters are obtained: 


dVIX, = 7.38 (01 -vo ) dt + 0. 5 VVIX,dB, + ydq: 
32 


(9.51) \ (21.76) (61 
h = 19.41(4.50) 1/n = 0.017(8.22) 


Figures in brackets denote asymptotic f-statistics and suggest that all par- 
ameter estimates are highly significant. 

This supports the argument that the VIX exhibits significant jumps. The 
model implies an average jump frequency (h) of around twenty per year with 
an average jump magnitude (1/ 7) equal to 1.7 per cent. Implied volatility is 
strongly mean-reverting. Since the average half-life of volatility can be 
defined for this process as log(2)/a, the intuitive interpretation of the speed 
of mean reversion is that, if volatility unexpectedly jumps upwards, it will 
take 1/a = 0.094 years, or 1.13 months, to get (halfway) back to its long-run 
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mean. We can also see that the addition of jumps decreases the estimated 
long-run mean (u) in comparison with the unconditional mean of the data, 
which is 0.196. 


5.4 Stochastic volatility 


An alternative approach to measuring volatility is to embed it within a formal 
stochastic model for the time series itself. A simple way to do this is to allow 
the variance (or conditional variance) of the process generating the time series 
to change either at certain discrete points in time or continuously. Although a 
stationary process must have a constant variance, certain conditional variances 
can change. For a non-linear stationary process x, the variance, V(x;), is a 
constant for all t, but the conditional variance V(x;|x;-1, x;-2,...) depends 
on the observations and thus can change from period to period. 


5.4.1 Stochastic volatility models 


Suppose that the sequence {x;,}; is generated by the product process 
x, =U +0:U, (5.3) 


where U, is a standardised process, so that E(U;) = 0 and V(U,) = 1 for all t 
and o, is a sequence of positive random variables usually such that 
V (xlo) = 07; o, is thus the conditional standard deviation of x, The term 
u gives the expected return of x, 

Typically U,= (x,— )/o,is assumed to be normal and independent of og 
we will further assume that it is strict white noise. Equation (5.3) can then be 
shown to be obtained as the discrete-time approximation to the stochastic 
differential equation 


dP 
Te d(log P) = udt + odW 


where x, = Alog P, and W(t) is standard Brownian motion. This is the usual 
diffusion process used to price financial assets in theoretical models of finance 
(see, for example, Hull, 2005). 

The above assumptions together imply that x, has mean p, variance 


E(x — u)’ = E(o; Ur) = E(0;)E(UP) = E(0;) 
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and autocovariances 
E(x; = u) (Xk i u) = E(o,0;_4U; U;_x) = E(o,0+—& U;) E( U;_x) = 0 


i.e. it is white noise. Note that both the squared and absolute deviations, 


S; = (xs — u)? and M; = |x — u|, can be autocorrelated, however. For 
example, 


Cov(S;, S-k) = E(S; — E(S;))(S;-% — E(S;)) = E(S;S;—-x) — (ECS)? 


in which case we have 


E(oto7_x) ~ (E(?))° 
E(o#) — (E(9?))” 


where ps is the kth autocorrelation of S, 


Ps = 


What models are plausible for the conditional standard deviation o ? Since it 
is a sequence of positive random variables a normal distribution is inappro- 
priate, but, as it is likely that o, will be skewed to the right, a log-normal 
distribution would seem to be a plausible choice. We can define the basic 
log-normal autoregressive stochastic volatility (ARSV) model of order one as 


h, = log(o7) = yo + Yih- +n, (5.4) 


where n, ~ NID(0, 07) and is independent of U, i.e. E(n,U,) =0. A common 
interpretation of h, is that it represents the random and uneven flow of 
new information into financial markets: see Clark (1973) and Tauchen and Pitts 
(1983). The parameter y, captures the persistence in volatility; when it 
approaches one and o7, is close to zero, volatility evolves smoothly. Returns are 


homoskedastic in the limit when yı = 1 and o7, = 0. We then have 
x, = u + U,exp(hi/2) 


It can easily be shown that x, is a martingale difference process. Also, since 
U, is always stationary, x, will be (weakly) stationary if and only if h, is, which 
will be the case if |y,| < 1. Assuming this, then using the properties of the 
log-normal distribution shows that all even moments of x, and S, will exist, 
being given by 
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E(x, — 2)’ = E(S,)'?= E(Uy) E(exp C n)) 


s (Ee); n+ (5) @) 


where 1,=E(h,) =yol(1—y;) and o? = V(h,) = a(i — y2). All odd 

moments are zero. The moment measure of kurtosis is then given by 
E(S;) E E(x — u)“ 

(E(S:))? (E(x: — u)}?) 


5= 3exp(o;,) >3 


so that the process has fatter tails than a normal distribution. The auto- 
correlation function of S, follows from the fact that 


E(S,S,_,%) = E(oto7_,) = E(exp(hr)exp(hy-x)) = E(exp(hy + h-x)) 
= exp( (un + 0%) + (ua + 110%) = exp(2Hn + 07, (1 + v7)) 
Hence 
Cov(S,, S,_,) = exp(2u, + 07,(1+ y$)) — exp(2u), + 97) 
= exp(2un + 4) (exp(av1) — 1) 
and 


(exp(oy{) — 1) 
3(exp(oj,) — 1) 


Ps = 


Taking logarithms of (5.3) yields 


log(S,) = hy + log(U;) = up + + log(U;) 


Nt 
(1— yB) 
which shows that log(S,) ~ ARMA(1, 1), but with non-normal innovations: if 
U; is normal then log( U?) has mean —1.27 and variance 4.93 and a very long 
left-hand tail, caused by taking logarithms of very small numbers. The 
autocorrelation function of log (S,) is 


vi 
Prilos(s) = (+ 4.93/02) 


Note that it is possible that some values of S, may be zero, in which case 
their logarithms cannot be taken. One way of overcoming this difficulty 
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is to employ the modified log-squared transformation (see Broto and 
Ruiz, 2004) 


Sf = log(S; + esg) — es§/ (S; + c5§) 


where så is the sample variance of S,and cis a small number, often set to be 0.02. 

Several extensions have been made to the basic ARSV model described 
above (for a review, see Ghysels, Harvey and Renault, 1996). A variety of 
heavy-tailed distributions have been proposed for the error process n, and 
when the errors follow a t-distribution the model can be interpreted as 
having two independent volatility processes (see Liesenfeld and Richard, 
2003). Harvey and Shephard (1996) and Jacquier, Polson and Rossi (2004) 
have developed two different models that allow for correlation between the 
shocks in the mean and variance processes. These models are able to capture 
the ‘leverage’ effect, in which negative (positive) shocks to returns are asso- 
ciated with increases (decreases) in volatility. This effect stems from the 
observation that bad news about a firm, which decreases the price of the 
firm’s stock and hence increases the debt-to-equity ratio (i.e. its financial 
leverage), makes the firm riskier and tends to increase future expected 
volatility. In the Harvey and Shephard (1996) model the errors in the mean 
and variance equations have correlation p, i.e. E(7;U;) = po,,. The ARSV with 
leverage can then be written 


h = Yo + ae + P0701] U;-1 exp(—0.5/,_1) + na 


where 17; is distributed as N(0,0;,(1 — p*)) with E(U;n;) = 0. 

Yu (2005) shows that the specification proposed by Harvey and Shephard 
(1996) is superior to the Jacquier, Polson and Rossi (2004) model in terms of 
both its interpretability of the leverage effect and its empirical validity. 
Moreover, it has the advantage of retaining the martingale difference 
property for the underlying process. Ding, Granger and Engle (1993), among 
others, have argued that volatility may be better modelled using a persistent, 
long-memory process. This is motivated by empirical findings demonstrat- 
ing that the autocorrelations of squared returns decay at a much slower rate 
than the expected exponential decline. As with the aggregation of ARMA pro- 
cesses (Granger, 1980), Zaffaroni (2007) shows that long memory in variance 
can result from aggregating certain stochastic volatility processes. Breidt, 
Crato and de Lima (1998) and Harvey (1998) have developed long-memory 
specifications of stochastic volatility models where the logarithmic variance 
follows an ARFIMA(p, d,q) process. Liu (2000) proposes a regime-switching 
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ARSV process that also exhibits long memory. Finally, various multivariate 
extensions of stochastic volatility models have also been developed (see, 
for example, Harvey, Ruiz and Shephard, 1994, and Chib, Nardarib and 
Shephard, 2006). 

Stochastic volatility models have the significant advantage of having 
representations in both continuous and discrete time. This is very important 
for option pricing and theoretical finance, since most of the models in this 
literature are expressed in continuous time. Using the results of Meddahi and 
Renault (2004), a general class of autoregressive stochastic volatility models 
that are closed under temporal aggregation may be defined that allow for a 
precise relation to be derived between continuous- and discrete-time para- 
meterisations. For example, it can easily be shown that the ARSV(1) is a 
discrete-time approximation to the continuous-time Ornstein—Uhlenbeck 
diffusion process that is widely used in the option pricing and interest rate 
literature (see Davidson, 2006b). 

One of the most popular of these models is the one-factor, square root 
volatility model, or scalar affine diffusion, of Heston (1993). The equations 
for the conditional mean and conditional variance are 


and 
dV; = (a = Bo,)dt + ov V;dBoy 


Here p; is the logarithm of the asset price and V; is the instantaneous (latent) 
stochastic volatility, which is assumed to follow a square root process. The 
dBi, i=1,2, are Brownian motions with instantaneous correlation pdt, 
implying that the stochastic volatility premium is linear. When the correl- 
ation is unity, we obtain a single-factor model such as the autonomous 
volatility model discussed in the previous section. The parameters oy, 6 and 
p are of great importance for option pricing, since they express the manner 
by which p, deviates from the standard log-normal assumption. Heston 
(1993) uses Fourier inversion to derive convenient analytical formulae for 
option pricing under stochastic volatility, and these have become very 
popular. The parameter œ expresses the unconditional mean of volatility, £ is 
the speed of mean reversion and oy is the volatility of volatility parameter. 
The presence of kurtosis and fat tails is dependent upon the size of oy relative 
to B. For the process to be well defined, the following inequalities must hold: 
a, B >Oand of, < 2a. Empirical studies typically find a negative correlation 
p, which is sometimes referred to as a ‘continuous-time’ leverage effect. 
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Although the Heston model has not been found to offer a realistic 
representation of index returns, primarily due to insufficient kurtosis, the 
addition of a jump component improves performance considerably (see, for 
example, Eraker, Johannes and Polson, 2003). Eraker (2004) proposes a 
model with discontinuous correlated jumps in stock prices and volatility 
with a state-dependent arrival intensity. A simple generalisation of the 
Heston model is the constant elasticity of variance (CEV) model, which can 
be derived by replacing the square root in the variance diffusion term by 
an exponent of undetermined magnitude (see, for example, Jones, 2003). 
Bollerslev and Zhou (2006) have recently used the Heston model to study 
various volatility puzzles concerning the leverage effect and the difference 
between realised and option-market-implied volatilities. They also try to 
explain the empirical controversies with respect to the relationships between 
contemporaneous returns and realised volatility, and returns and implied 
volatility, respectively. 


5.4.2 Estimation of stochastic volatility models 


Until a few years ago, stochastic volatility models were rarely used in 
empirical applications because they were particularly difficult to estimate. 
Despite the fact that their statistical properties are easy to derive using 
established results on log-normal distributions, the likelihood for the tran- 
sition density functions and parameters are rarely available in closed form. 
Moreover, volatility is a latent factor that cannot be observed directly from 
historical data, and therefore it must either be approximated, typically by 
inverting an option pricing formula, or it must be ‘backed out’ via inte- 
gration from the stochastic volatility model. 

Fortunately, powerful estimation schemes have been proposed in recent 
years (for reviews, see Shephard, 1996, and Broto and Ruiz, 2004), and these 
have led to the development of several empirical methods, including simu- 
lated methods of moments, efficient methods of moments, analytic 
approximations to the likelihood function and spectral methods. Much 
interest has centred on the Markov chain Monte Carlo (MCMC) approach 
(see the review by Johannes and Polson, 2007). This is a computationally 
intensive technique that is well suited to continuous-time stochastic vola- 
tility models, as it directly computes the distribution of the latent variables 
and parameters given the observed data. Due to the Bayesian nature of 
the estimation it can also quantify estimation and model risk. Although all 
these approaches can deliver consistent and often asymptotically efficient 
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estimates, they are still computationally demanding and rather difficult to 
implement. 

A special case of the MCMC algorithm that is based on Gibbs sampling 
can be applied using the freely available software BUGS (for a description, 
see Meyer and Yu, 2000). A convenient and popular estimation method is 
still, however, quasi-maximum likelihood (QML), as outlined in Koopman 
et al. (2006, chap. 7.5) and available in their STAMP software package. This 
technique, which uses the Kalman filter, also provides an estimate of the 
volatility ož. Although it has been shown that this estimator is consistent and 
asymptotically normal, it is inefficient since it does not rely on the exact 
likelihood function. Another practical approach has been proposed by 
Bollerslev and Zhou (2002), who exploit the distributional information con- 
tained in realised volatility to construct a simple conditional moment esti- 
mator for stochastic volatility diffusions using GMM. Although this approach 
has the shortcoming of requiring high-frequency data, Bollerslev and Zhou 
demonstrate using simulation that it provides highly reliable and accurate 
estimators in finite samples. 


Example 5.3 A stochastic volatility model for the dollar/sterling exchange rate 
In this example we fit the SV model 


x, = u + U;,exp(h,/2) 


h, = log(o7) = Yo + Yih- +m 


to the daily series of dollar/sterling first differences initially examined in 
example 2.5, where it was found to be close to zero-mean white noise. To use 
the QML technique of Koopman et al. (2006), the model is rewritten as 


x, = oU;exp(h,/2) 


hy = yih-1 + ns, 


where o = exp(yo/2), or as 


log(x) = k + h, + u 


hy = yihi1 + 
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1994 1996 1998 2000 2002 2004 


Dollar/sterling exchange rate ‘volatility’ (daily January 1993—December 2005) 
where 

u = log(U;) — E(log(U;)) 

and 

K = log(o”) — E(log(U?)) 


QML estimation yields the following estimates: 6, = 0.050, 6 = 1.906, Yo = 
1.312 and y, = 0.995, and a plot of the exchange rate volatility, given by the 
‘smoothed’ estimates (of the square root) of exp(h,/2), is shown in figure 5.3. 
The conditional variance equation is close to a random walk and the time- 
varying nature of the volatility can clearly be seen. 


Example 5.4 A leveraged logarithmic ARSV(1) model for the DJI 


In this example we estimate the leveraged stochastic volatility model proposed 
by Harvey and Shephard (1996). The model is estimated by the MCMC 
methodology, using the all-purpose Bayesian software package BUGS. This 
software allows an easy and efficient implementation of the Gibbs sampler, a 
specific MCMC technique that constructs a Markov chain by sampling from 
all univariate full-conditional distributions in a cyclical way. The BUGS code 
necessary for estimating the model accompanies the paper by Yu (2005). 
The model can be conveniently represented as 


h 


h1, Yo, Voo S N(v T vihi?) 


hi1, hy, Yo» Vis P ae N((e/on) exp(hi/2) (hi1 — Yo — vihi), exp(h;)(1 = p’)) 


Xt 
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Table 5.1 Empirical estimates of the leveraged ARSV(1) model for the DJI 


Average Standard deviation 95% credible intervals 
w —7.912 0.337 (—8.557, —7.241) 
Vi 0.998 0.001 (0.996, 1.000) 
p —0.758 0.090 (—0.911, —0.561) 
On 0.102 0.015 (0.078, 0.136) 


The prior distributions are assumed to be independent using the specifica- 
tions adopted by Yu (2005), following Kim, Nelson and Startz (1998). More 
specifically, a is distributed as inverse gamma with parameters 2.5 and 
0.025, so that it has a mean of 0.167 and a standard deviation of 0.024, and 
(yı +1)/2 is beta distributed with parameters 20 and 1.5, so that it has a 
mean of 0.93 and a standard deviation of 0.055. We also assume that œ =yo/ 
(1— yı) ~ N(O,25) and that p is uniformly distributed with support between 
—1 and 1. We perform 20,000 iterations and discard the first 10,000. The 
algorithm is initialised by setting w = 0, yı = 0.98, o;, = 0.025, and p=—0.4. 
We use daily data for the DJI between 20 September 2002 and 8 September 
2006, a total of 1000 observations. 

The posterior means for the parameters of the estimated model are 
reported in table 5.1, which also gives the standard deviations and the 
95 per cent Bayes credible intervals of the posterior distributions. It is evident 
that all parameter estimates are significant, while the significantly negative 
value found for p confirms the existence of a leverage effect. 


5.5 ARCH processes 


5.5.1 Development of generalised ARCH processes 


In the previous section, the process determining the conditional standard 
deviations of x, was assumed not to be a function of x, For example, for 
the AR(1) log-normal model of equation (5.4), o, was dependent upon the 
information set {7,,0;-1,0+-2,..-}. We now consider the case where the 
conditional standard deviations are a function of past values of x, i.e. 


Ot = Sais Xt-25+- .) 
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A simple example is 


1/2 


o: = eS (Neg = (æo + a) (%-1 — w’) (5.5) 


where @p and œ; are both positive. With U,~ NID(0,1) and independent of 
Op X,= u + U is then white noise and conditionally normal — i.e. 


|i is Hee ~ NID(n, o7) 
so that 
V(x|x-1) = æo +01 (x-1 — fh)? 


If a, < 1 the unconditional variance is V(x) = &œọ/(1 — a@,) and x, is weakly 
stationary. The fourth moment of x; is finite if 307 < 1 and, if so, the kurtosis 
is given by 3(1 — af) /(1 — 3a7). This exceeds three, so that the uncondi- 
tional distribution of x, is fatter-tailed than the normal. If this moment 
condition is not satisfied, then the variance of x? will not be finite and hence 
x? will not be weakly stationary. 

This model was first introduced by Engle (1982) and is known as the first- 
order autoregressive conditional heteroskedastic, or ARCH(1), process. ARCH 
processes have proved to be an extremely popular class of non-linear models 
for financial time series, as can be seen from the various and many surveys 
of the literature that have been published: Engle and Bollerslev (1986), 
Bollerslev, Chou and Kroner (1992), Bera and Higgins (1993), Bollerslev, 
Engle and Nelson (1994), Engle (2002), Li, Ling and McAleer (2002), 
Giraitis, Leipus and Surgailis (2006) and Teräsvirta (2007) is a by no means 
exhaustive list. Instructive expositions of ARCH modelling from a practical 
perspective can be found in Engle (2001) and Engle and Patton (2001). 

A more convenient notation is to define ¢,=x,— u = U; on so that the 
ARCH(1) model can be written as 


€4|%-1, Xt-2, ee OY NID(0, o7) 


Qe 2 
o; =A + 1E; 


Defining v, = E? — o, the model can also be written as 


e? =a + OE, 4 + Vy 
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Since E(v;|x;-1, %-2,.-.) = 0, the model corresponds directly to an AR(1) 
model for the squared innovations ¢?. As v; = 07(U; — 1), however, the 
errors are obviously heteroskedastic. 

The ARCH(1) model can be interpreted in various ways. For example, 
suppose the parameters in the ARCH equation are defined as a = V(u,) and 
a, = V(¢,), where u, and ¢, are independent. Thus, 


V(e,) = V(u) + Vo)? 


which is consistent with £, being generated by a random coefficient AR(1) 
process 


Ep = PrEt-1 + Ut 
= $,0;-1 Up_-1 + ut 


where E(@,) = ¢ and u, has mean zero (see Bera and Higgins, 1993, for more 
on this interpretation). 
A natural extension is the ARCH(q) process, where (5.5) is replaced by 


N 2\ 1/2 
See ) = (u+); iiH m’) 


where œọo=0 and a; > 0, 1 <i <q. The process will be weakly stationary if all 
the roots of the characteristic equation associated with the ARCH param- 
eters, (B), lie outside the unit circle — i.e. if yo a; < 1, in which case the 
unconditional variance is V(x;) = ao /(1 an a). In terms of €, and 0?, 
the conditional variance function is 


2 q 
o; =A + J ja 2iEt-i 


or, equivalently, 
E? = œo +a(B)e, +v 


Detailed discussion of the ARCH(q) model, setting out further technical 
conditions that need not concern us here, may be found in, for example, 
Engle (1982), Milhøj (1985) and Weiss (1986a). 

A practical difficulty with ARCH models is that, with q large, uncon- 
strained estimation (to be discussed later) will often lead to the violation of 
the non-negativity constraints on the a; s that are needed to ensure that the 
conditional variance o? is always positive. In many early applications of the 


177 


Univariate non-linear stochastic models 


model a rather arbitrary declining lag structure was thus imposed on the as 
to ensure that these constraints were met. To obtain more flexibility, a 
further extension, to the generalised ARCH (GARCH) process, was proposed 
(Bollerslev, 1986, 1988); the GARCH(p,q) process has the conditional 
variance function 


2 q 2 P 2 
o; =a + J pa CE t J fa bidii 


=a) + a(B)e? + B(B)o? 


where p>0 and 6;>0, 1 <i< p. 

For the conditional variance of the GARCH(p, q) model to be well defined, 
all the coefficients in the corresponding ARCH(co) model o? = 6 + 0(B)e? 
must be positive. Provided that a(B) and 6(B) have no common roots and 
that the roots of 6(B) lie outside the unit circle, this positivity constraint is 
satisfied if and only if all the coefficients in 6(B) = a(B)/(1 — B(B)) are non- 
negative. Necessary and sufficient conditions for this are given in Nelson and 
Cao (1992). For the GARCH(1,1) process, 


Dy 2 2 
0; = Qo + 1E; + 1041 


a model that has proved extremely popular for modelling financial time 
series, these conditions require that all three parameters are non-negative. 
The equivalent form of the GARCH(p, q) process is 


€; = œo + (a(B) + B(B))et_, + v: — P(B)vi-ı (5.6) 


so that e? ~ ARMA(m, p), where m= max(p,q). This process will be weakly 
stationary if and only if the roots of æ(B) + 6(B) lie outside the unit circle — 
i.e. if &(1)+ B(1)<1. This also ensures that £, is weakly stationary, but it is 
only a sufficient, rather than a necessary, condition for strict stationarity. 
Because ARCH processes are thick-tailed, the conditions for weak stationarity 
are often more stringent than those for strict stationarity. For example, Nelson 
(1990a) shows that £; and o? will be strictly stationary in the GARCH(1,1) 
model if and only if 


E(log(B, + Oy Ur))<1 


and this will be satisfied if, for example, U,~ N(0,1), a, =3 and 6, =0, 
although the conditions for weak stationarity are clearly violated. Stationarity 


178 


The Econometric Modelling of Financial Time Series 


conditions for the general GARCH(p,q) process are derived in Bougerol and 
Picard (1992). 

These complications with stationarity conditions carry over to the concept 
of ‘volatility persistence’ in GARCH models. If (1) + 6(1) =1 in (5.6) then 
a(B) + (B) contains a unit root, and we say that the model is integrated 
GARCH, or IGARCH (p,q) (see Engle and Bollerslev, 1986). It is often the case 
that w(1) + B(1) is very close to unity for financial time series, and, if this 
condition holds, a shock to the conditional variance is persistent in the sense 
that it remains important for all future forecasts. As Bollerslev, Engle and 
Nelson (1994) argue, however, the concept of persistence in GARCH models is 
ambiguous. One reasonable definition is to say that shocks fail to persist when 
o? is stationary, so that the conditional expectation E(o na ere .) 
converges, as s — œO, to the unconditional variance œọ/(1 — æ(1) — B(1)). An 
alternative definition concentrates on forecast moments and says that shocks 
fail to persist if and only if E (07! |€, €11, ...), for some n > 0, converges to a 
finite limit independent of €p €,-1,.... 

Unfortunately, whether or not shocks persist can depend on which 
definition is adopted. For example, consider the GARCH(1,1) model 


One =a) + ae, + bi = Œo + æo, (U? + Bı) 


from which we have 


E(o7, ,|éts€:-1,-- +) = a ( a (a 4 B,)") o (a + B,)° 


It is easy to see that the conditional expectation converges to the uncondi- 
tional variance a@/(1 — a, — B,) if and only if œ; + 6, <1, whereas in the 
IGARCH model with a, + 6; =1 the conditional expectation will tend to 
infinity as s increases — Le. 


Efo lEt Et-1)-- .) = say + o? 


Yet IGARCH models are strictly stationary and, in this case, E (ag, ler, €r-1,-- .) 
converges to a finite limit whenever 0<n< 1 (see Nelson, 1990a). The 
implication of this is that any apparent persistence of shocks may be a 
consequence of thick-tailed distributions rather than of inherent non- 
stationarity. 

Persistence may also be characterised by the impulse response coefficients. 
The GARCH(1,1) process can be written, with ġı = œ; + fı, as 
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(= Bye; = a + (1 — B,B)y; 
or as 
Ae? = (1 — B)(1 — $B) "(1 — BB); = 0(B) 


The impulse response coefficients are found from the coefficients in the 0(B) 
lag polynomial 


O=1, &4=¢-6-1, G§=(-B)(b-DYOH’, jz2 


The cumulative impulse response @(1) is zero because 6(B) contains a unit 
root or, equivalently, because )7;6; = ($; — B,)¢)', which exponentially 
tends to zero in the limit as long as 6; =a, + B; < 1. When ġı =a,+ 6, = 1, 
however, so that we have an IGARCH(1,1) process 


Ac? = a + (1 — BB); 


>; 9 = 1 — $, = (1) # 0, and hence shocks persist indefinitely. 


5.5.2 Modifications of GARCH processes 


Although we have assumed that the distribution of €, was conditionally 
normal, this is not essential. Bollerslev (1987), for example, considers the 
case where the distribution is standardised-t with unknown degrees of 
freedom v that may be estimated from the data: for v > 2 such a distribution 
is leptokurtic and hence has thicker tails than the normal. Other distribu- 
tions that have been considered include the normal—Poisson mixture dis- 
tribution (Jorion, 1988), the power exponential distribution (Baillie and 
Bollerslev, 1989), the normal—log-normal mixture (Hsieh, 1989a) and the 
generalised exponential distribution (Nelson, 1991). Estimation procedures 
have also been developed that either estimate semi-parametrically the density 
of c, (Engle and Gonzalez-Rivera, 1991) or adaptively estimate the param- 
eters of ARCH models in the presence of non-normal £, (Linton, 1993). 

Further modifications result from allowing the relationship between o7 
and £, to be more flexible than the quadratic mapping that has so far been 
assumed. These modifications often lead to general classes of GARCH 
models that have been used to study asymptotic properties, the existence of 
moments and other time series characteristics. To simplify the exposition, 
we shall concentrate on variants of the GARCH(1,1) process 
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Oy = Oy + a7, + b10; = do +1071 Upa + Bory (5.7) 
An early alternative was to model conditional standard deviations rather 
than variances (Taylor, 1986, and Schwert, 1989): 

Or = a + [E11] + Byor_-1 = go + ior] Ur-1| + By or-1 (5.8) 


This makes the conditional variance the square of a weighted average of 
absolute shocks rather than the weighted average of squared shocks. Con- 
sequently, large shocks have a smaller effect on the conditional variance than 
in the standard GARCH model. 

Rather than concentrating on the variance or standard deviation, Ding, 
Granger and Engle (1993) proposed a more flexible and general class of 
power ARCH (PARCH) models by estimating an additional parameter: 


oy = œo + aile” +810 


A non-symmetric response to shocks is made explicit in Nelson’s (1991) 
exponential GARCH (EGARCH) model 


log(ot) = œo + aif (€1-1/01-1) + Bilog(o7_;) (5.9) 


where 


f(Er-1/Or-1) = O1€1-1/Or-1 + (lEr-1/or-1| — Eler—1/or-1|) 


The ‘news impact curve’, f(.), relates revisions in conditional volatility, here 
given by log (o), to ‘news’, €;_;. It embodies a non-symmetric response since 
Of /O£,-; = 6, +1 when ¢€,_;>0 and Of /Oe,-; = 6; — 1 when ¢€;_\<0. 
(Note that volatility will be at a minimum when there is no news: €;-; = 0). 

This asymmetry is potentially useful, as it allows volatility to respond more 
rapidly to falls in a market than to corresponding rises, which is an important 
stylised fact for many financial assets and is known as the leverage effect. This 
model also has the advantage that no parameter restrictions are necessary in 
order to ensure that the variance is positive. It is easy to show that f (€+—1) is 
strict white noise with zero mean and constant variance, so that log(o?) is an 
ARMA(1,1) process and will be stationary if 6, < 1. 

A model that nests (5.7), (5.8) and (5.9) is the non-linear ARCH (NARCH) 
model (Higgins and Bera, 1992), a general form of which is 


o% = œo + af” (€11) + bioi 
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while an alternative is the process 

oY = æo + ag” (Er1) + Biot, 

where 

g” (E41) = O1(ep_-1>0) + Jeri” + 0er < 0) + leva” 


I(-) being the indicator function. If y=1, we have the threshold ARCH 
(TARCH) model of Zakoian (1994), while for y = 2 we have the GJR model 
of Glosten, Jagannathan and Runkle (1993), which allows a quadratic 
response of volatility to news but with different coefficients for good and bad 
news, although it maintains the assertion that the minimum volatility will 
result when there is no news. More general but less popular versions of 
threshold models that allow for richer behaviour include Rabemananjara and 
Zakoian (1993), Li and Li (1996) and Audrino and Bithlmann (2001). 

Hentschel (1995) defines a very general class of model that nests all the 
above ARCH models. The model can be written using the Box and Cox 
(1964) transformation as 


1 =l 


o; — d Ory 
= a + a07_,f’(U-1) ir (5.10) 


t 


where 
f(U;) = |U: — b| — c(U; — b) 
Several variants cannot be nested within (5.10). Engle’s (1990) asymmetric 


ARCH (AARCH) and Sentana’s (1995) quadratic ARCH (QARCH) are two 
such models. These can be written in the simple case being considered here as 


_ 2 2 
= Qo + Q1E$_; + ôE + Bi Oy_| 


where a negative value of 5 means that good news increases volatility less 
than bad news. It is the presence of a quadratic form in €,_, that precludes 
them from being included as special cases of (5.10). 

An alternative way of formalising the GARCH(1,1) model (5.7) is to 
define a = w(1 — a, — B,), where w is the unconditional variance, or 
long-run volatility, to which the process reverts to: 


o = w +o (ei - w) + Bj (oi —@) 
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Engle and Lee (1999) extend this formalisation to allow reversion to a 
varying level, defined by qs 


o? = q + a(z = 4-1) + By (o a qi-1) 
qi = © + E(Gr-1 — w) + leZ — O74) 


Here q, is long-run volatility, which converges to w through powers of ¢, 
while o? — q; is the transitory component, converging to zero via powers of 
a,+ fı. This component GARCH model can also be combined with the 
TARCH model to allow asymmetries in both the permanent and transitory 
parts: this asymmetric component GARCH model automatically introduces 
asymmetry into the transitory equation. 

The stochastic variance and GARCH classes of models have some obvious 
similarities, and a comparison between them is provided in Taylor (1994). 
Finally, as in Robinson (1991), we can form the ARCH(oc) class: 


vee a nd, 
(o — bo + j=1 bjezi 


This can be shown to include the finite-order ARCH and GARCH models as 
special cases. For example, the GARCH(p, q) can be expressed as 


o; = (1 — A(1)) ‘ao + (1 — B(B)) 'a(B)e; 


This allows an ARCH(o0) representation with bọ = (1 — B(1)) ‘ao and with 
positive weights that decay exponentially according to a(z)/(1 — 8(z)) = 
>> =, biz’. In this manner the conditional variance can be represented as a 
moving average of past squared errors with exponentially decaying coeffi- 
cients and an absolutely summable exponentially decaying autocovariance 
function. For a detailed discussion of ARCH (co) models, see Giraitis, Leipus 
and Surgailis (2006). 


5.5.3 Non-linear GARCH processes 


Attempts have been made in the literature to develop ‘non-linear’ versions of 
GARCH models that allow for even more flexibility in the functional rela- 
tionship between variance and lagged errors. Although non-linear specifi- 
cations for the conditional mean process will be discussed in more detail in 
chapter 6, a brief introduction to the most popular of these non-linear 
conditional variance models will be made here. 
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The smooth transition GARCH (STRGARCH) model can be defined as 
o = Q10 + wE] + (29 + &21E€1-1)f (E€t-1; ô, c) + Bio 


where f (€,—1; ô, c) is a continuous bounded transition function. If a logistic 
transition function is used with c= 0, then the model is equivalent to the 
GJR-GARCH. The STRGARCH can be useful in situations where we do not 
want to limit the analysis by allowing only two distinct regimes for the 
conditional variance (for a discussion of these models, see Gonzalez-Rivera, 
1998). Lanne and Saikkonen (2005) proposed a smooth transition GARCH 
process where the lagged conditional variance acts conveniently as the 
transition variable, thus enabling the persistence in the conditional variance 
to depend on its level: 


o; = ay + Me | a ôf (o7_139) T Biot, 


The transition function is modelled using the cumulative distribution 
function of the gamma distribution. The original motivation for using this 
model was to overcome the tendency of GARCH models to exaggerate the 
persistence in the conditional variance process — i.e. to estimate ` (œ; + B;) 
to be very close to unity. 

It has been argued that GARCH parameters may not remain constant when 
dealing with data spanning a long time period. For example, Sensier and van 
Dijk (2004) report extensive evidence of changes in volatility for a wide set of 
US macroeconomic time series, while Andreou and Ghysels (2002) have 
evaluated the performance of various tests for detecting structural breaks in 
the conditional variance dynamics of asset returns. As shown by Mikosch and 
Starica (2004) and Hillebrand (2005), the alleged exaggeration of shock per- 
sistence implied by estimated GARCH models may be due to shifts in the 
unconditional variance. Various modelling approaches have been proposed to 
deal with such shifts. For example, Terasvirta (2007) discusses a time-varying 
GARCH process where parameters may vary according to a smooth transition 
function of a normalised time index. Two additional classes of models have 
also been suggested, the structural ARCH (STARCH) model proposed by 
Harvey, Ruiz and Sentana (1992) and the switching ARCH (SWARCH) model 
proposed by both Cai (1994) and Hamilton and Susmel (1994). Both require 
estimation by the Kalman filter: the former decomposes ¢,_; into various 
unobserved components, each of which have ARCH forms; the latter postu- 
lates several different ARCH models between which the process switches via 
a Markov chain (see chapter 6 for models of this type). 
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Semi- and non-parametric methods have also been used in order to relax 
the assumptions concerning the distribution of residuals and the functional 
form in GARCH models. A comprehensive treatment of such approaches 
can be found in Linton (2007). 


5.5.4 Long-memory volatility processes: the FIGARCH model 


An apparent stylised fact of return series is that the absolute values or powers, 
particularly squares, of returns tend to have very slowly decaying auto- 
correlations. For example, Ding, Granger and Engle (1993) find that the first 
negative autocorrelation of the squared returns of the daily S&P 500 index 
over the period 1928 to 1991, analysed in example 4.3, occurred at lag 2598, 
and a similar finding has been provided by Mills (1996a) for the daily returns 
of the London FT30 for the period 1935 to 1994. Additional evidence of this 
feature for financial series is provided by, for example, Taylor (1986) and 
Dacorogna et al. (1993). 

In the spirit of Granger (1980), it has been suggested that long memory in 
the volatility of stock indices may be due to the aggregation of covariance 
stationary processes that individually exhibit short-memory conditional het- 
eroskedasticity. Although it has been found that this is not possible through a 
summation of the GARCH process, Zaffaroni (2007) discusses conditions and 
models that can lead to long memory under aggregation. 

It is also interesting to note the effect of scaling, since it has very important 
applications in risk management and option pricing. As has been discussed 
earlier, acommon practice is to derive an estimate of the annualised standard 
deviation, or volatility, by simply multiplying the daily standard deviation by 
the square root of the number of trading days in a year. This procedure 
assumes that returns are iid, however — something that is clearly violated for 
most financial time series. Assuming that the data follow a GARCH(1,1) 
process, Diebold et al. (1998) use the results of Drost and Nijman (1993) on 
the temporal aggregation of GARCH processes to show that simple scaling 
is inappropriate and potentially very misleading. 

In response to these findings of long memory, Baillie, Bollerslev and 
Mikkelson (1996) consider the fractionally integrated GARCH (FIGARCH) 
process (a closely related process, the long-memory GARCH (LMGARCH) 
model, has been analysed by Karanasos, Psaradakis and Sola, 2004). The 
FIGARCH(1, 4d, 1) process is most transparently defined as an extension of (5.6): 


Ate; = œo + (œi + b1) A E + ve — Bivi (5.11) 
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Equivalently, but perhaps less transparently, it can be written as 

o =a + (1— A?) — (b, — (a, + B,)A*)e?_, + Bor, (5.12) 
(5.12) can be expressed as 

Ac; = a + Ah “(1 — (a1 + B,)B) (1 — B,B)v = a3 + O(B) Yr 


and (5.11) as 


o; = ao /(1— B,) + (1 — (1 — (a1 + By) B)(1 — BB) 'A“)e7 = ay" + 2(B)e; 


Baillie, Bollerslev and Mikkelson (1996) show that the FIGARCH(p, d, q) 
class of processes is strictly but not weakly stationary for 0< d< 1. FIGARCH 
processes with 0<d<1 have 6(1)=0, so that shocks to the conditional 
variance ultimately die out. Unlike the d=0 case, however, D 6; decays 
eventually at a hyperbolic, rather than an exponential, rate, so that the 
fractional differencing parameter provides important information about the 
pattern and speed with which shocks to volatility are propagated. For d> 1, 
@(1) is undefined and the conditional variance is explosive. The conditions 
that ensure a positive conditional variance for the FIGARCH(1,d,1) process 
are œo >0, a} + d>0and1—2(a,+ f,) > d>0. 

Baillie, Bollerslev and Mikkelson (1996) argue that the presence of FIGARCH 
processes may explain the common finding of IGARCH behaviour in high- 
frequency financial data. It is commonly argued (Nelson, 1990b; Nelson and 
Foster, 1994) that GARCH(1,1) models provide consistent discrete-time 
approximations to continuous-time diffusion processes and, as the sampling 
interval goes to zero, the sum of the two GARCH parameters tends to one, 
indicating IGARCH behaviour. IGARCH implies that shocks to the conditional 
variance persist indefinitely, however, and this is difficult to reconcile with the 
persistence observed after large shocks such as the crash of October 1987, and 
also with the perceived behaviour of agents who do not appear to alter the 
composition of their portfolios frequently and radically, as would be implied 
by IGARCH. Temporal aggegation issues also cast doubt on the reasonable- 
ness of IGARCH models. Drost and Nijman (1993) show that an IGARCH- 
generating process at high frequencies should carry over to low frequencies of 
observation, but this seems at odds with most reported empirical findings. 

Given these anomalies, Baillie, Bollerslev and Mikkelson suggest that 
the widespread observation of IGARCH behaviour may be an artefact of a 
long-memory FIGARCH data-generating process, and they provide a 
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simulation experiment that provides considerable support for this line of 
argument. It would thus seem that FIGARCH models should be seriously 
considered when modelling volatility. 


5.5.5 Estimation of ARMA models with ARCH errors 


The analysis has so far proceeded on the assumption that ¢,=x,— u is 
serially uncorrelated. A natural extension is to allow x, to follow an ARMA 
(p,q) process, so that the combined ARMA-ARCH model becomes 


0(B)(x1 — u) = @(B)er (5.13) 
P 
o? = E(e?|er-1, €r-2, aR .) =a) + oes OE? + yee 0; (5.14) 


This latter equation can be written as 


7 a eee | 
0; = Z, © = Z 01 + %,@2 


where 

z} = (az) = e ee see A 
and 

oT = (ol :w]) = (aois Bi aa) 


Using this notation, ML estimates of the model can be obtained in the following 
way. Define Q as the vector of parameters in the model given by equations 
(5.13) and (5.14) and partition it as Q = (oliy), y! = (®;,..., Ọp, 
©1,...,©Q, u) being a vector containing the parameters in the ARMA 
equation. We may also define 29 = (wg :Wo ) as the true parameter vector. 

The log-likelihood function for a sample of T observations is, apart from 
some constants, 

_ T 

Lr(2) = T! X`, 4(9) 
where 


L(2) = log{f (e,/o;:<)} — 0.5logo? 


is the log-likelihood for the tth observation and f(¢€;/o;:>) denotes the con- 
ditional density function for the standardised innovations €,/0,, which has 
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mean zero, variance one and nuisance parameters ç. Precise details of ML 
estimation may be found in, for example, Engle (1982), Weiss (1986a, 1986b) 
and Bollerslev (1988). The BHHH alogorithm of Berndt, Hall, Hall and 


Hausman (1974) is a convenient method of computation. If Q denotes the 


parameter estimates after the ith iteration, then QHD is calculated by the 
algorithm as 


T -l 
ee a Ol, Ol Ol 
(i+1) _ 8) >> P OM 5 t 
z Z raf sm) Q 


t=1 


where 01,/9Q is evaluated at Q and A; is a variable step length chosen to 
maximise the likelihood function in the given direction. Because the infor- 
mation matrix, I = —E (Ə 21, / 0 R20 aN is block diagonal, w can be estimated 
without loss of asymptotic efficiency based on a consistent estimate of w, and 
vice versa, so that the iterations for w’ and w can be carried out separately. 

The ML estimate Q is strongly consistent for Qo and asymptotically 
normal with mean Qo and covariance matrix I~’, consistently estimated by 


-1 
figs (ee (01,/0Q)(01,/0 2") , which may be obtained from the last 


BHHH iteration. 

Of course, the actual implementation of the ML procedure requires an 
explicit assumption about the conditional density f(¢;/o;:<¢). The most 
commonly employed distribution is the normal, for which 


log f (€+/01:5) = —4log(2z) — logo? — 4 (e? / 0?) 


From the discussion in section 5.5.1, the ARCH model with conditionally 
normal errors results in a leptokurtic unconditional distribution. Nonethe- 
less, the degree of leptokurtosis so induced often does not capture all the 
fat tails present in financial data; this is discussed in detail in chapter 7. 
Consequently, various alternatives were discussed in section 5.2. Perhaps the 
two most popular are Bollerslev’s (1987) standardised t-distribution, and 
Nelson’s (1991) generalised exponential distribution (GED). 
For the ¢distribution, 


ig n(v — 2)P(v/2) Lice 
2 I'((v+ 1)/2)” 


u ac oa) 


log f(Er/ors) = 
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Here I'(-) denotes the gamma function, and the degrees of freedom 
parameter v > 2, which controls the tail behaviour of the distribution, may 
also be estimated from the data. As is well known, the ¢distribution is 
symmetric around zero and converges to the normal distribution as v — oo, 
but for 2<u<oo the conditional kurtosis equals 3(v — 1)/(v — 2), which 
exceeds the normal value of three, so the estimate of v will provide an 
indication of the fatness of the tails. 
For the GED, 


log f(€:/o4:5) = e( r(1/8) ) Liege (Fasc) 


2 PNTE) 2 r(1/4)o? 


The tail parameter is £ > 0. The normal distribution is obtained when £= 2 
and for smaller values the distribution is fat-tailed. An alternative approach 
uses mixtures of normals to represent the conditional distribution (e.g. see 
Haas, Mittnik and Paoelella, 2004). This preserves the assumption of nor- 
mality for the error process while capturing characteristics of the data such as 
asymmetry and fat tails. For a detailed discussion of the issues involved in 
estimation and inference in GARCH models, see Li, Ling and McAleer 
(2002), Bollerslev, Engle and Nelson (1994) and Straumann (2004). 

What are the consequences of ignoring possible non-normality and 
continuing to use the normal density for f(€;/o;:¢)? This is known as quasi- 
ML (QML) estimation and produces an estimated Q that is consistent and 
asymptotically normal. Indeed, Jensen and Rahbek (2004) prove that, unlike 
the unit root case, these properties hold for the GARCH(1, 1) model over the 
entire parameter region, including both stationary and explosive behaviour. 
Although standard errors will be inconsistent, they can be corrected using 
robust alternatives (Bollerslev and Wooldridge, 1992). 

For symmetric departures from conditional normality, the QML estimator 
Q is generally close to the exact ML estimator $, but for non-symmetric 
conditional distributions both the asymptotic and the finite sample loss in 
efficiency may be quite large, and semi-parametric estimation may be pre- 
ferred. For a comparison of the loss in asymptotic efficiency of QML and 
semi-parametric estimation compared to ML, see Gonzalez-Rivera and 
Drost (1999). While the asymptotic properties of the GARCH QML esti- 
mators are well understood, the limiting distribution of the ARMA-GARCH 
model has been established under the rather strict assumption of finite 
fourth moments. Recently, Ling (2007) has proposed a self-weighted QML 
approach for estimating ARMA-GARCH and IGARCH models and shows 
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that it is consistent and asymptotically normal under only a fractional 
moment condition for errors. 

In practice, different software packages, optimisation algorithms and 
initial value parameterisations may, unfortunately, lead to significant vari- 
ations in estimates (e.g. see Brooks, Burke and Persand 2001). In an attempt 
to overcome these problems, Kristensen and Linton (2006) have recently 
developed a closed-form estimator for the parameters of the GARCH(1,1) 
model that has the advantage of not requiring numerical optimisation and 
an arbitrary selection of initial values. Assuming that fourth moments exist, 
Kristensen and Linton show that the combination of the estimator with a 
finite-order Newton—Raphson procedure will yield asymptotically the same 
distribution as QML. 

Finally, it should be noted that the application of ARCH estimation 
techniques depends on the variance process being observable and measur- 
able. ARCH models have been widely used in a variety of applications in 
finance and economics as latent processes, however, in which case the log- 
likelihood function cannot be expressed analytically. To circumvent this 
problem, Fiorentini, Sentana and Shephard (2004) develop exact likelihood- 
based estimators of latent variable ARCH-type models using an MCMC 
algorithm. 


5.5.6 Testing for the presence of ARCH errors 


Let us suppose that an ARMA model for x, has been estimated, from which 
the residuals e, have been obtained. The presence of ARCH can lead to 
serious model misspecification if it is ignored; as with all forms of hetero- 
skedasticity, analysis assuming its absence will result in inappropriate par- 
ameter standard errors, and these will typically be too small. For example, 
Weiss (1984) shows that ignoring ARCH will lead to the identification of 
ARMA models that are overparameterised, and Milhgj (1985) demonstrates 
that standard tests of serial correlation may over-reject the null. 

Methods for testing whether ARCH is present are therefore essential, 
particularly as estimation incorporating it requires the complicated iterative 
techniques discussed above. Equation (5.6) has shown that if £, is GARCH 
(p,q) then E is ARMA(m,q), where m= max (p,q), and Bollerslev (1986) 


shows that standard ARMA theory follows through in this case. This implies 


2 
t 


therefore p, in a fashion similar to the way the usual residuals are used in 
conventional ARMA modelling. McLeod and Li (1983), for example, show 


that the squared residuals ef can then be used to identify m and q, and 
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that the sample autocorrelations of e? have asymptotic variance T” ' and that 
portmanteau statistics calculated from them are asymptotically x7 if the e? 
are independent. 

Formal tests are also available. Engle (1982) shows that a test of the null 
hypothesis that £, has a constant conditional variance against the alternative 
that the conditional variance is given by an ARCH(q) process — i.e. a test of 
a) = +++ = &q4 = 0 in (5.13) conditional upon $; = --- = B, = 0 — may be 
based on the Lagrange multiplier principle. The test procedure is to run a 
regression of e? on e?_,,...,e7 å and to test the statistic T - R? as a X variate, 
where R? is the squared multiple correlation coefficient of the regression. An 
asymptotically equivalent form of the test, which may have better small sample 
properties, is to compute the standard F test from the regression. The intuition 
behind this test is clear. If the data are indeed homoskedastic, then the variance 
cannot be predicted and variations in e? will be purely random. If ARCH 
effects are present, however, such variations will be predicted by lagged values 
of the squared residuals. Of course, if the residuals themselves contain some 
remaining autocorrelation or, perhaps, some other form of non-linearity, then 
it is quite likely that this test for ARCH will reject, since these errors may 
induce autocorrelation in the squared residuals; we cannot simply assume that 
ARCH effects are necessarily present when the ARCH test rejects. 

Strictly, since the parameters of an ARCH model must be positive, a test of 
ARCH should be formulated as a one-sided test, which should presumably 
be more powerful than the above T - R? test. Engle, Hendry and Trumble 
(1985) thus suggest a one-sided test for ARCH(1) by using the square root of 
the LM test with an appropriate sign, but this approach cannot be extended 
to test higher-order ARCH(q) alternatives. In this situation, either the test 
proposed by Lee and King (1993), and extended by Hong (1997), or that of 
Demos and Sentana (1998) may be employed. These tests are necessarily 
more complicated to derive and compute, and we refer the reader to the 
above references for details. 

When the alternative is a GARCH(p, q) process, some complications arise. 
In fact, a general test of p > 0, q> 0 against a white-noise null is not feasible, 
nor is a test of GARCH(p+ 1,,q+ r2) errors, where r; > 0 and n > 0, when 
the null is GARCH(p,q). Furthermore, under this null, the LM test for 
GARCH(p,r) and ARCH(p-+ r) alternatives coincide. What can be tested is 
the null of an ARCH(p) process against a GARCH(p, q) alternative — i.e. a test 
of œ =0 using the notation of the previous section: Bollerslev (1988) 
provides details. 
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Lumsdaine and Ng (1999), among others, have studied the behaviour of 
LM tests for ARCH when the conditional mean equation is misspecified due 
to, for example, omitted lags, parameter instability and structural change. 
They show that misspecification will typically result in an over-rejection of 
the null hypothesis of no ARCH and propose a heuristic approach using 
recursive residuals to improve performance. Blake and Kapetanios (2007) 
demonstrate that the effect of misspecification in the conditional mean 
equation, due to neglected non-linearity in particular, will affect severely the 
size of LM tests for ARCH. They also propose new robust testing procedures 
that rely on testing for ARCH via standard approaches after removing 
possible non-linearity from the conditional mean process using a non- 
parametric approach. A general approach to testing for ARCH effects and 
misspecification has been proposed by Lundbergh and Terdsvirta (2002), 
who develop a number of LM-type procedures to test for the presence of 
ARCH and misspecification in GARCH models with respect to asymmetry 
and parameter constancy. Finally, Dufour et al. (2004) have proposed a 
Monte Carlo approach to derive finite-sample GARCH tests under possibly 
non-normal error distributions. 

As with ARMA modelling, information criteria have also been used to 
identify the correct lag structure and type of GARCH model amongst 
competing specifications. As shown by Brooks and Burke (2002), however, 
information criteria may not be able to identify the true model but, rather, 
the best approximating model from those available. Brooks and Burke 
(2003) derive appropriate modifications of standard information criteria for 
selecting models from the AR family with GARCH errors. Hughes, King and 
Teng (2004) suggest using a modification of the AIC to account for the 
one-sided nature of ARCH parameters. 


5.5.7 ARCH and theories of asset pricing 


The importance of ARCH processes in modelling financial time series is seen 
most clearly in models of asset pricing that involve agents maximising 
expected utility over uncertain future events. To illustrate this, consider the 
following example, taken from Engle and Bollerslev (1986). Suppose a rep- 
resentative agent must allocate his or her wealth, W,, between the shares of a 
risky asset q; at a price p, and those of a risk-free asset x„ whose price is set 
equal to one. The shares of the risky asset will be worth y,,, each at the end 
of the period (if there are no dividends, then y,,,; = p;+1). The risk-free asset 
will be worth r,x,, where r, denotes one plus the risk-free rate of interest. 
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If the agent has a mean-variance utility function in end-of-period wealth, 
W;+1 = qtyi+1 + 1X; then the allocation problem for the agent is to 
maximise this utility function with respect to holdings of the risky asset, q — 
Le. to maximise 


2E, (qiyi + ri) — Vi Vildi) 

subject to the start-of-period wealth constraint 

W, = x + Prge 

This has the solution 

Pe = te Ei (Yia) — Vider, Vi (Yia) (5.15) 


If the outstanding stock of the risky asset is fixed at q, and y, and r, are taken 
as constants (y and r respectively), then (5.15) describes the asset pricing 
model. 

If the risky asset is interpreted as a forward contract for delivery in s 
periods’ time, the price that a pure speculator would be willing to pay is 


p= r° (E (Virs) = 8V: (Yi+s)) (5.16) 


where r ` gives the present discounted value at the risk-free rate rand ô = yq. 
A simple redating of the model shows that the price of the forward contract 
at time t+ 1, for s> 2 periods remaining to maturity, can be expressed as 


Pett = rS (Erti (Vits) — 8 Vita (ets) 


Taking expectations at time t, multiplying by r * and subtracting from (5.16) 
gives 


Pr = r Ef pira) — 8 (Vi (yis) — Er( Vivi (is))) (5.17) 


Now, suppose y, can be represented by an infinite moving average process 
where the innovations are uncorrelated but have time-varying conditional 
variance 0°: 


Yt = Et + ae OiE +; = O(B)ey (5.18) 


VilYer1) = Vilen) = otp 
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Thus, 

Ss 2 5 
ViN+s) = BS, 6.1143) E ae 9, +E (0741) 
Consequently, 


Vi (Yis) — EC Vivi Vets) = O Ta 


and (5.17) becomes 


Pi = 7 Et (pry) — Sr O ot, 

which is the familiar formula for a one-period holding yield with the explicit 
calculation of the effect of the changing variance of y,,, for a risk-averse 
agent. 

In this simple model the only source of uncertainty derives from the future 
spot price to which the contract relates. In many other situations, however, 
there is a flow of uncertain distributions that accrue to the owner of the asset: 
for example, the price of a share is determined by the present discounted 
value of the expected dividend stream. The precise form in which the vari- 
ability of future pay-offs enters the asset pricing formulation will depend, 
amongst other things, on the utility function of the agents and the inter- 
temporal substitutability of the payouts. A simple formulation might be 


Pt = oe, r° (Ei (Yis) — 8Vi(Yr+s)) 


where {y, };41 is the future income stream generated by the asset. If y, again 
follows the process (5.18), this pricing equation can be converted to the 
holding yield expression 


P= r (E, (pei) + Ey) — 540744) 


where à depends upon 6(B) and r. 

It is clear therefore that, if ê 4 0, the conditional variance of y, in the future 
will affect the price of the asset today. If such variances can be forecast as in a 
GARCH process, then the current information on y, and the current con- 
ditional variance will have an effect on the current price. The size of the 
effect, however, will depend upon the persistence of the variance — i.e. on 
how important current information is in predicting future variances. 

A closed-form solution to the simple asset pricing formula (5.16) depends 
upon the process assumed to generate the ‘forcing variable’ y, Suppose y; is a 


194 


The Econometric Modelling of Financial Time Series 


random walk with innovations that follow an IGARCH(1,1) process. Then 
E:(yr+s) = y: and 


For a future contract where no money changes hands until the terminal 
date t+ s, the risk-free rate of return is zero so that r= 1, i.e. the solution 
simplifies to 


Pr = yi — 8507, 
If 5 Æ 0 there will be a time-varying risk premium in the future contract. For 
contracts far in the future, new information will have a substantial effect on 
asset prices as it changes agents’ perceptions of the variance of the final pay- 
off as well as all the intermediate variances. This persistence gives time- 
varying risk premia even for contracts many periods into the future, and thus 
implies sizeable effects on asset prices. 

Alternatively, suppose that the random walk innovations to y, are serially 
independent with constant variance o°. In this case V;(y;;) = so” and the 
solution to (5.16) is 


Pe = ye — so? 


so that, although the variance of the spot price enters the pricing equation, it 
does not give rise to a time-varying risk premium since new information 
casts no light on future uncertainty. 

Finally, consider an intermediate case where the innovations are GARCH 
(1,1) such that a@,+6,<1. The unconditional variance will be 


o? = a/(1 — a; — B,), and it is easy to show that 


E (07; — 0°) = (a + Bo Ea Sg) 


and 
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Substituting into (5.16), the solution of the future contract is 


1 — (a rer) 


1—a, — fı 


a y bso? + Slo o°) ( 


Current information, embodied in the term o? — o, continues to be an 


+1 
important part of the time-varying risk premium even for large s, but, in 
contrast to the solution for the IGARCH(1,1) model, where a, + 6; = 1, its 
importance decreases with the length of the contract. 

These examples thus establish that a solution to an asset pricing equation 
depends in a crucial way on the distribution of the forcing variable, y, in 
particular on its conditional variance, which is naturally modelled as an 
ARCH process. 

We should also note that, in a manner analogous to stochastic variance 
models being discrete approximations to continuous-time option valuation 
models that use diffusion processes, ARCH models can also approximate a 
wide range of stochastic differential equations. This was first shown by 
Nelson (1990b), and further developments are contained in, for example, 
Nelson and Foster (1994), Drost and Nijman (1993) and Drost and Werker 
(1996). Further analysis of the predictive aspects of ARMA-ARCH models is 
developed in Baillie and Bollerslev (1992). The survey by Bollerslev, Chou 
and Kroner (1992) focuses on the application of ARCH models to stock 
return and interest rate data, emphasising the use of ARCH to model 
volatility persistence, and to foreign exchange rate data, where the charac- 
terisation of exchange rate movements has important implications for many 
issues in international finance. 


Example 5.5 GARCH models for the dollar/sterling exchange rate 

Here we fit various GARCH(p,q) models to the first differences of the 
dollar/sterling exchange rate, the level of which was found to be a driftless 
random walk (see examples 2.5 and 4.3). Thus, with Ax,=¢€, assuming 
homoskedasticity — i.e. GARCH(0,0) — produces an LM test statistic for 
twelfth-order ARCH of 72.2, which shows that there is strong evidence of 
conditional heteroskedasticity, and, as six of the twelve lag coefficients in the 
autoregression of the squared residuals are significant, a GARCH formula- 
tion is suggested. Not surprisingly, the residuals are highly non-normal, 
being fat-tailed and positively skewed. 

GARCH(1,1) estimates under normality, t and GED distributional 
assumptions are reported in table 5.2. In all cases, both GARCH parameters 
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Table 5.2 GARCH(1,1) estimates for the dollar/sterling exchange rate 


Normal t GED 
ĉo 7.11 (1.36) 3.12 (1.50) 4.64 (1.86) 
â 0.032 (0.004) 0.033 (0.006) 0.033 (0.006) 
B 0.957 (0.005) 0.963 (0.006) 0.961 (0.007) 
t . 5.45 (0.59) 1.21 (0.040) 
a +Ê 0.989 0.997 0.993 
ARCH(12) 10.0 [0.61] 8.7 [0.72] 9.4 [0.67] 
L 11627.0 11710.7 11731.1 


Figures in ( ) are standard errors; figures in [ ] are prob-values. ARCH(12) is the LM test for 
twelfth-order ARCH. L is the log-likelihood. Estimation was performed in EVIEWS 5 using 
the BHHH algorithm. Estimates of ap are scaled by 1077. 
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Conditional standard deviations of the dollar sterling exchange rate from the GARCH(1,1) model with 
GED errors 


are significant and the LM test for any neglected ARCH is insignificant. Note 
that the GARCH parameters sum to just under unity, suggesting strong 
persistence in conditional variance. The distribution parameter estimates are 
what might be expected from a fat-tailed x, distribution. In terms of log- 
likelihoods, the GED assumption produces the best fit, but little difference 
is found in the GARCH parameter estimates for all three models. The 
conditional standard deviations from this model are shown in figure 5.4. 
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A comparison with the volatility series from the SV model fitted in 
example 5.3, shown in figure 5.3, reveals a close similarity. 

A variety of non-linear GARCH variants were also entertained. Noting 
that the EGARCH model (5.9) can be written as 


log(o7) =a+ayler1/or-1| + veri fori + Bylog(o7_,) 


where œ = a — @,/2/m and y=a6,, estimation of this model with GED 
errors obtained 


l 2) = — 0.129 + 0.070|e,_1 /o4_ 0.018 €;_1/o4_-; + 0.9921 g 
080%) (0.034) ğ ples 1/0 7 + (0.008) “r 1/0 i (0.003) 06(%r1) 


y is significantly different from zero, so there does appear to be an asym- 


metric news effect. The estimate of the GED parameter is E = 1.21, but, as 
L=11730.7, this model is just inferior to the GARCH specification. Both 
TARCH and component GARCH models produced no improvement over 
the basic GARCH formulation, and the PARCH(1, 1) model was estimated as 


821 0,963 01 


o}? = 1.01 x 1077 +.0.033|e,_1| 
008) (0.007) 


(2.12x10-7) (0 


For this model € = 1.21 and L=11731.9. Since Å = 1.82 is accompanied by 
a standard error of 0.41, the null that A =2 cannot be rejected, thus sup- 
porting the choice of the GARCH(1, 1) specification (a comparison of the L 
values for the two models also confirms this). 


Example 5.6 GARCH models for S&P 500 daily returns 
In this example we again analyse the daily returns (logarithmic first differ- 
ences) of the S&P 500 index from January 1928 to August 1991, first looked 
at in example 3.7. Ding, Granger and Engle (1993) initially fitted an MA(1)- 
GARCH(1,1) model with normal innovations to the returns, x,. We find that 
an MA(1)-GARCH(1,2) model with standardised—t innovations provides a 
somewhat better fit: 


x; = 0.00054 +e, + 0.137 £11 
(0.00006) (0.008) 


o? = 8.58X 1077 + 0.104 €?_, + 0.586 07_, + 0.30607_, 
(1.28x10-7) (0.009) (0.108) (0.101) 


6 = 5.86(0.24)  L=57425.8 ARCH(12) = 18.1[0.11] 


The GARCH parameters sum to 0.992, indicating IGARCH behaviour. 


198 


The Econometric Modelling of Financial Time Series 


The model can, nevertheless, be improved upon. The TARCH extension of 
this model is 


x, = 0.00042 + €; + 0. er 1 
(0.00006) (0.008) 


o; = 9.24% 10-7 + 0.0047 €¢_, + 0.107 I(e:-1<0) -€7_) + 0.62407, 
(1.26x10- D (0.007) (0.012) (0.105) 


+ 0.268 02, 
(0.097) 


6 = 6.23(0.26) L = 57505.9 


while the EGARCH variant is 


a OA Dee: 1 
(0.00006) (0.008) 


1 2\ = — 0.261 + 0.188 — 0.088 
og(o;) (io) (Gon) ler 1/01 7 (0.007 hee if Ti 1 


+ 0.658 log(a; i 1) +0, 329 log (oy 2) 


© = 6.25(0.26) L = 57553.3 


Both models provide a significant leverage effect, with ‘bad’ news, €,;<0, 
increasing volatility more than ‘good’ news. 

The TARCH model can be extended by using the asymmetric power ARCH 
(APARCH) of Ding, Granger and Engle (1993). This model, now estimated 
with standardised-t innovations, is 


x; = 0.00039 +€; +0. Drep l 
(0.00006) (0.008) 


1.14 
= 0.000057 + 0. 100( [er 1| k ) + 0.6160; 
(0.000021) (0.008 39) (0.091) 


+ 0.294075 
(0.085) 


© = 6.24(0.26) L = 57554.2 


The standard error attached to A = 1.14 is 0.07, so that the traditional 
conditional variance model having à = 2 is certainly rejected, although the 
conditional standard deviation model with à= 1 (see (5.8)) is marginally 
acceptable. Nevertheless, the leverage effect remains significant. 
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5.6 Some models related to ARCH 


5.6.1 Simple and exponential moving averages 


An ad hoc modelling approach that is popular amongst practitioners for 
predicting short-term variance is based on smoothing methods using 
moving averages of historical squared returns. Simple moving averages 
cannot easily capture volatility clustering and require the selection of an 
arbitrary window length. The exponential moving average is more popular, 
and has been shown by Boudoukh, Richardson and Whitelaw (1997) to 
perform well in empirical applications. A variance forecast can be derived by 
using the simple exponential recursive formula with smoothing parameter, 
or decay factor, A: 


ô; = hep + (L—-A)e, 

The size of the decay factor determines the relative weight assigned to 
more recent observations. The larger (smaller) the factor is, the smaller (larger) 
the weight given to recent observations; recall the analysis of chapter 4, 
section 1.2. The exponential moving average is the preferred method of 
volatility forecasting within the RiskMetrics risk management approach (see 
Mina and Xiao, 2001, for a comprehensive description) and is, of course, a 
special case of an IGARCH model. The proponents of this approach suggest 
that values à = 0.94 and 0.97 seem to work well for predicting volatility at a 
daily and monthly interval, respectively, reflecting the high persistence found 
in many volatility measures. 

Adaptive exponential moving average models have also been proposed to 
allow for variation in the smoothing parameter. Taylor (2004) uses a logistic 
function of some user-defined variable(s) as an adaptive smoothing par- 
ameter. This smooth transition exponential smoothing (STES) variance 
estimator can be formulated as 


ô = Mey) +(1- Ar) 
where 
Ari = (1 + exp(B+ yV1)) 


The smoothing parameter changes between zero and one according to the 
variations in the transition variable V,_,. By analogy to non-linear GARCH 
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models, Taylor proposes using both actual and absolute values of the lagged 
residual as transition variables. 

He also notes the analogy between STES and the smooth transition 
GARCH(1,1) model proposed by Anderson, Nam, and Vahid (1999): 


o; =(1 — f(er-1)) (æo oF CTAN Bn B103) + f(Er-1) (% + Net T 8107) 
In this model the transition is governed by the logistic function 


f(Er1) = (1 + exp(—¥er1)) 


The STES model is a constrained formulation with «wọ =g =0, 6; =1, 
Yo =0, yı = 1 and 5, = 0. Taylor (2004) proposes estimating the parameters 
of the STES model by minimising the sum of squared deviations between 
realised and forecast volatility, and presents empirical evidence that it per- 
forms well in capturing the dynamics of the variance when compared to a 
variety of GARCH models and simple exponential moving averages. 


5.6.2 Autoregressive conditional duration models 


The rapid advances in information technology during the 1990s both 
enabled and motivated the collection of financial data at very high sampling 
frequencies. These developments opened up a new area in empirical finance, 
that of high-frequency or tick data analysis (see the review by Goodhart and 
O’Hara, 1997, and the contributions in Lequeux, 1999). From a broader 
point of view, as discussed by Granger (1998) and Engle (2000), the avail- 
ability of huge high-frequency data sets is changing the science and practice 
of statistics, econometrics and decision making in many different ways. The 
time series properties of high-frequency data have been investigated with 
respect to a variety of features, including market microstructures, auto- 
correlations in mean and variance, intradaily seasonalities, long memory, 
interaction between market variables (volume, volatility, returns, trading 
frequency, spreads, etc.), dynamic and contemporaneous linkages between 
markets, forecastability and non-linearities. 

An interesting direction concerns the development of econometric models 
to describe the behaviour of time series that are sampled at irregular inter- 
vals. In these, the time between sampling intervals is treated as a random 
variable rather than assumed to be fixed. A major application concerns 
modelling intertemporally correlated variations in price duration — i.e. in the 
time between successive quote changes or transactions. Let t; be the ith 
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transaction time with 0 = tọ < tı < --- <tr. The duration between trans- 
actions can then be represented by .X; = ti — ti. Let the expectation of the 
ith duration, conditional on all past durations and a parameter vector 0, be 
represented by 


Yi = W(X] Xi-1,-.., X15 0) = E(Xi]Xi-1,..., X1), Xi/ Y; = ui 


where u; is an iid non-negative process of disturbances with a given para- 
metric density distribution p(u,@). This distribution is related to the hazard 
function, or baseline hazard, which is given by the probability density of u; 
divided by the survival function of u;, the latter being one minus the 
cumulative distribution function. 

Engle and Russell (1997, 1998) have proposed the auto regressive condi- 
tional duration, or ACD(p,q), process to model the dynamics of the condi- 
tional duration process and to predict how long it will be until prices change: 


Vi = ao + Sa AmXi-m + an Pig 


Although Engle and Russell (1998) use exponential and Weibull distribu- 
tions for the residuals, other distributions, such as the generalised gamma, 
the log-logistic and the log-normal, are also relevant. Since the distribution 
of u; is fully specified, maximum likelihood is applicable for estimation and 
inference purposes. As shown by Engle and Russell, under certain conditions 
the conditional ML estimates will be asymptotically normally distributed. It 
is interesting to note that, although the ACD is formulated in transaction 
time, it models the frequency and distribution of the calendar time between 
events. 

The ACD is closely related to the GARCH, since it models the conditional 
mean duration as a function of lagged durations and their conditional 
expectations. As with the variance process and GARCH models, the ACD 
attempts to exploit the fact that transactions are not homogeneously dis- 
tributed through time but tend to cluster in a stochastic manner. Since the 
ACD can also be shown to have an ARMA representation, forecasts can be 
obtained using the standard ARMA approach. The ACD modelling frame- 
work can be combined with GARCH models to allow for dynamics in both 
the conditional duration and variance process, respectively (e.g. see Engle, 
2000, and Grammig and Wellner, 2002). Note that standard GARCH soft- 
ware and routines can be applied to obtain consistent estimates of the ACD 
model parameters using QML. In this setting, the square root of the duration 
is the dependent variable while the conditional mean equation is zero. 
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A number of extensions to the ACD models have been proposed that are 
analogous to the various modifications of the GARCH model. For example, 
Bauwens and Giot (2003) consider a logarithmic transformation of the ACD 
model that enables the non-negativity restriction on Wy; to be relaxed. This 
model avoids the overprediction of duration, originally noted by Engle and 
Russell (1998), by allowing for a concave shocks impact curve. Zhang, 
Russell and Tsay (2001) develop a non-linear threshold ACD (TACD) model 
with different regimes for the persistence, conditional means and error 
distributions, respectively. More flexible hazard rate function specifications 
have also been considered. Following Hentschel (1995), Fernandes and 
Grammig (2006) derive a general class of augmented ACD (AACD) models 
by using a Box—Cox transformation of the conditional duration. The AACD 
class is shown to include most of the ACD specifications that have been 
proposed as special cases. 

Modelling via ACD processes is typically followed by simple diagnostic 
procedures to evaluate whether the residuals are iid. Fernandes and Gram- 
mig (2005) develop a testing procedure for the distribution of the error term 
in ACD models. In the first step, the ACD model is estimated and consistent 
estimates of the errors are obtained. In the second step, non-parametric and 
parametric estimates of the baseline density and hazard rate functions are 
compared. In addition to the fact that ACD models allow a more careful 
examination of microstructure theories and a more accurate estimation of 
conditional volatility, these models have found application in risk manage- 
ment (Giot, 2000) and option pricing (Prigent, Renault and Scaillet, 2001). 


5.6.3 Modelling higher moments of the conditional distribution 


ARMA-GARCH models are able to capture dependencies in the mean and 
variance and can produce non-Gaussian unconditional distributions that are 
asymmetric and fat-tailed. Although the first two moments are permitted to 
change through time, the shape of the underlying distributions remains 
constant. Since finance theory and empirical evidence both suggest that 
variations in higher moments are potentially important for portfolio opti- 
misation and risk management, there has been some effort put into mod- 
elling time variation in skewness and kurtosis. 

A direct approach involves using an autoregressive conditional moments 
(ARCM) model, which imposes dynamics directly on higher moments (see 
Harvey and Siddique, 2000, and Brooks et al., 2005). The estimation of 
ARCM models can be extremely burdensome on computational resources, 
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however, with the result that modelling both skewness and kurtosis is not 
straightforward. Gallant, Rossi and Tauchen (1992) employ a semi-non- 
parametric approach based on a series expansion of the Gaussian density. 
This allows the density to be expressed as a polynomial in the past history of 
the time series. This parameterisation is not parsimonious, however, and 
large data sets are necessary in order to achieve a reasonable level of accuracy. 
The implementation of the model is also computationally expensive and 
involves the arbitrary determination of the order of expansion. 

To overcome these problems, Hansen (1994) has proposed using a 
GARCH model in which the shape parameters of a closed-form distribution 
vary according to conditioning variables. Under this autoregressive condi- 
tional density (ARCD) model, the standardised density of the residuals 
follows a generalised skewed ¢-distribution: 


be, ta 2 —(n+1)/2 Zé 
be(1 + = (9°) E&<- $ 
gledn, à) == i ae 2 —(n+1)/2 

Et TA a 

be( 1+ ots (482) E& 2-5 
—2 T 1 
a=anc(” ) P= 1+3% -e c= (+1) 

n=l x(n — 2) (n/2) 


The skewness and kurtosis coefficients, 4 € (—1,1) and 7 € (2,00), 
respectively, are the degrees of freedom. Although the distribution is parsi- 
monious, it can produce a rich variety of asymmetric and fat-tailed shapes. If 
A is positive (negative) then the variable is skewed to the right (left). When 
à =0 the density collapses to a standard t while a normal distribution is 
obtained when 7 =o. 

In the ARCD model, shape parameters are allowed to vary through time 
according to functions of lagged error terms. Hansen (1994) uses a logistic 
transformation to satisfy the boundary constraints on the conditional 
skewness and kurtosis coefficients. Although the shape parameters are sta- 
tionary, they are allowed to assume extreme values, which is useful for 
capturing jump behaviour in empirical data. Jondeau and Rockinger (2003) 
derive analytical formulae for skewness and kurtosis under ARCD in terms of 
the parameters of the generalised f-distrubution. The third and fourth 
moments are shown to exist if 7 exceeds three and four, respectively. They 
also determine the largest possible domain of values for the shape parameters 
for which a density exists. Since a parametric density is assumed for the 
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standard error, maximum likelihood estimation is possible and estimates will 
be asymptotically normal. Jondeau and Rockinger propose constrained 
optimisation via a sequential quadratic programming algorithm to avoid the 
instabilities that arise in parameter estimation. 


5.7 The forecasting performance of alternative volatility models 


It is evident from this chapter that substantial efforts have been put into the 
development, specification and estimation of volatility models. Naturally, 
the issue of forecasting is also important, since volatility figures prominently 
in a variety of applications in investment, portfolio management, asset 
pricing, risk management and monetary policy. A large literature has 
appeared over recent years on investigating which model is superior in terms 
of predictive power and why. Poon and Granger (2003) review some ninety- 
three papers that appeared over two decades on this subject, and conclude 
that implied volatility estimated from options data appears to provide the 
most reliable forecasts as a wider information set is used. GARCH models 
generally rank second, often having comparable performance to that of 
simple volatility forecasts based on smoothing filters, especially for series that 
are likely to contain nonstationarities. 

Poon and Granger (2003) emphasise that the issue of evaluating volatility 
forecasts is complicated by a number of factors. First, the latent nature of 
volatility means that it is not clear what is to be predicted. Most of the early 
research concentrated on the ability of models to predict squared daily 
returns. As shown by Hansen and Lunde (2005, 2006a), if this approach is 
used in the evaluation of ARCH models then it is likely to result in an 
inconsistent ranking. More recent studies in forecast evaluation tend to 
adopt realised variance as a more accurate proxy of the volatility process. 
Even this is far from perfect, however, since, at very high frequencies, market 
microstructures may significantly influence the results (see Hansen and 
Lunde, 2006b). 

Model performance appears to vary across different markets and forecast 
horizons. Extensive recent empirical evidence by Pong et al. (2004) suggests 
that forecasts of realised volatility produced by ARMA models estimated 
over intradaily data are more accurate than implied volatilities only at daily 
and weekly forecast horizons. The incremental value of information in high- 
frequency data over implied volatilities becomes increasingly less important 
as the forecasting horizon increases. Unlike Koopman, Jungbacker and 
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Hol (2005), Pong et al. (2004) find that ARFIMA models estimated with 
intradaily data do not outperform their ARMA counterparts. Ghysels, Santa- 
Clara and Valkanov (2006) have proposed a promising approach using 
mixed data sampling (MIDAS) regressions in order to assess the forecasting 
performance of volatility models, as such regressions are able to compare 
forecasting models that differ in terms of the measure of volatility, the 
sampling frequency and the lag structure. 

Another significant problem facing empirical studies concerns the selec- 
tion of an appropriate cost function and evaluation method. For a review of 
the alternative methods, see Diebold and Lopez (1996) and Poon and 
Granger (2003). Many studies are limited to a comparison between simple 
measures such as MSE and MAE (mean absolute error), and, as noted by 
Poon and Granger, even such simple comparisons are confounded by the 
typically very wide confidence intervals arising from the leptokurtic nature of 
the data. Moreover, results may vary depending on whether the variance or 
the standard deviation is used as a benchmark in the cost function. 

Despite the large number of papers that have evaluated the predictive 
value of various models, Poon and Granger (2003) argue that research in this 
area is still in its infancy, pointing towards several interesting areas of 
investigation involving combinations of forecasts, exploiting additional 
regressors, modelling mean absolute deviations and dealing with structural 


breaks. 
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Univariate non-linear stochastic 
models: further models and 
testing procedures 


As discussed in chapter 2, the Wold decomposition theorem allows us to 
write every weakly stationary, purely non-deterministic stochastic process 
as a linear combination of a sequence of uncorrelated random variables: 


Xp — U = Ap + Piani + Yra Fo = os Wjarj, Yo =1 


A weakly stationary, purely non-deterministic stochastic process can be 
considered non-linear if it does not satisfy the assumptions underlying (2.1). 
In particular, a more general ‘non-linear’ representation can be obtained as 


Xp — U = f (at, ay), Gr-2,---) (6.1) 


where f(-) is some arbitrary non-linear function. The ‘curse of dimension- 
ality’ means that this representation is of little practical use however. 

Allowing for regularity conditions, consider a Taylor expansion of (6.1) 
around zero: 


xX -— b= f (0, at—1,; a2) + a, f'(0, ar-1, a2) 
+ 0.54f" (0, at—1, 1-2) +++: 


where f' and f” are the first and second derivatives of f with respect to a, By 
dropping higher-order terms, we can express x; in terms of its conditional 
moments. For example, by keeping only the first two terms, x, can be 
expressed as a function of the conditional mean and variance, respectively. 
Simple forms of non-linearity can be obtained by assuming some low-order 
polynomial function f; for example, the first-order non-linear moving 
average (see Robinson, 1977): 


E 2 
xX =a + Wy ayy 
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Polynomial functions of lagged x, can also be used (Jones, 1978), while 
another simple way of introducing non-linearity is to allow x; to respond in 
a different manner to innovations depending on their sign. For example, 
Wecker (1981) has introduced the asymmetric MA(1) process, 


Xt = apt o*a} —0 a 


where 0™ and 07 are positive and negative innovations, respectively. A wide 
variety of non-linear models have been developed that allow for combin- 
ations of AR and MA terms and for deterministic or stochastic variations in 
their parameters through time. The most popular of these models are now 
described in subsequent sections. 


6.1 Bilinear and related models 


6.1.1 The bilinear process 


An important class of non-linear model is the bilinear, which takes the 
general form 
R S 
(B)(x: — u) = O(B)er + `> be YijXt—iEt-j (6.2) 
i=1 j=l 
where €; ~ SWN (0, o2) . The second term on the right-hand side of (6.2) is a 
bilinear form in €,_; and x,_, and this accounts for the non-linear character 
of the model: if all the y;; are zero, (6.2) reduces to the familiar ARMA model. 
The bilinear model can be thought of as a higher-order Taylor expansion of 
the unknown non-linear function f(-) underlying the time series dynamics 
than that provided by the Wold decomposition. 
Little analysis has been carried out on this general bilinear form, but 
Granger and Andersen (1978) have analysed the properties of several simple 
bilinear forms, characterised as 


Xp = Et + V jgXt—iE tj 


If i> j the model is called superdiagonal, if i=] it is diagonal and if i< j it is 
subdiagonal. If we define A = yo then, for superdiagonal models, x, has zero 
mean and variance o7/(1— 7), so that |A| <1 is a necessary condition for 
stability. Conventional identification techniques using the SACF of x; would 
identify this series as white noise, but Granger and Andersen show that, in 
theory at least, the SACF of the squares of x, would identify x? as an ARMA 
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(i, j) process, so that we could distinguish between white noise and this 
bilinear model by analysing x?. 

Diagonal models will also be stationary if |A| <1. If i=j=1, x, will be 
identified as MA(1), with 0 < pı < 0.1547 (corresponding to A = +0.605), 
while Xe will be identified as ARMA(1,1). If x, actually is MA(1), however, 
then Xe will also be MA(1), so that this result allows the bilinear model to be 
distinguished from the linear model. In general, the levels of a diagonal 
model will be identified as MA(z). Subdiagonal models are essentially similar 
to superdiagonal models in that they appear to be white noise but generally 
have x? following an ARMA (i, j) process. 

Charemza, Lifshits and Makarova. (2005) discuss non-stationary gener- 
alisations of bilinear models that allow for unit roots. For example, they 
consider the following simple model: 


x= (a + be4-1) 4-1 + Ey (6.3) 


As shown by Granger and Andersen (1978), this process will be stationary if 
a’ + b’o2 <1. The process collapses to a random walk if a= 1 and b=0. If 
we assume that b differs from zero, however, while a equals one, we can 
express the process in first differences as 


Ax; = bx, -1€ +1 + Et (6.4) 


Assuming X%)=€)=0, it can be shown that E(x) = bo2(t—1) and 
E(Ax,) = bo2. This implies that we should expect a positive sign for b in 
empirical applications. The variance of the differenced process is (Charemza, 
Lifshits and Makarova, 2005, appendix A) 


2 


V(Ax;) = (502 + PE(c4)) (1 + Po?) -4t ot + Tbo- 40? 
Although the process is capable of producing mean-reverting behaviour, it is 
evident that it does not retain the desirable difference stationarity property of 
the random walk. The fact that the process allows predictability imposes a 
significant limitation with respect to its theoretical validity within standard 
financial theories. 

When a=1, (6.3) can be considered to be a special case of the more 
general process 


Xt = PtXt-1 + Et 
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where ¢; is a random autoregressive coefficient with E(g,) = 1. This unit root 
bilinear (URB) model resembles the random coefficient autoregressive 
process (Leybourne, McCabe and Mills, 1996) and the stochastic unit root 
processes (Granger and Swanson, 1997) discussed in chapter 3, section 7. 
The non-stationary bilinear process explicitly relates the unit root dynamics 
to the lagged residuals, however. Charemza, Lifshits and Makarova (2005) 
develop a simple f-ratio-type test for detecting bilinearity in a unit root 
process. 

For small values of b< 1/vT, we can reasonably assume that Ax, €; 
and the test regression can be formulated as 


Ax, = bx; Axi + uy 


The test statistic is simply the t-statistic of b in this regression estimated via 
OLS. Under the null of no bilinearity — i.e. a= 1 and b= 0 — this test statistic 
is asymptotically normally distributed. The test regression can be augmented 
by a constant, drift or further autoregressive components in a straightfor- 
ward manner by just adding the relevant terms. Charemza, Lifshits and 
Makarova suggest a two-step procedure: first test for a unit root and then 
test for bilinearity. This is consistent, in the sense that the size of the unit 
root test is not affected by the possible detection of bilinearity in the second 
step. Charemza, Lifshits and Makarova put forward theoretical arguments 
and empirical evidence that support the usefulness of URB processes in 
finance. 

Detailed analysis of the properties of bilinear models can be found in 
Granger and Andersen (1978), Subba Rao (1981), Subba Rao and Gabr 
(1984), Guégan (1987) and Pham (1993). Most of the results are of con- 
siderable theoretical interest but are of little relevance in practice: for 
example, most of the conditions for stationarity and invertibility are too 
complicated to be used as constraints on the parameters in actual models. 


6.1.2 A comparison of ARCH and bilinearity 


Weiss (1986b) provides a detailed comparison of the ARMA-ARCH model, 
given by equations (5.13) and (5.14), and the bilinear model (6.2). At first 
sight, the models appear quite different: whereas the addition of the ARCH 
equation to the pure ARMA process (5.13) introduces non-linearity by 
affecting the conditional variance, the addition of the bilinear terms con- 
tained in (6.2) changes the conditional mean of x, Weiss argues that, despite 
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these different influences, the two processes can have similar properties, and, 
for example, the bilinear process may be mistaken for an ARMA model with 
ARCH errors. 

Why might this be? Suppose the true model for x; is (6.2) but the ARMA 
model 


3(B)(x — ji) = (BE, 


is fitted. The residual, £, is given by 


where 9: (B) = ¢ !(B)6~'(B)@(B)6(B) and (B) = $~! (B)! (B)o(B). On 
squaring this expression and taking conditional expectations, it is clear that 
E(@ | eps X2; .) is not constant but will be a function of lagged ¢?, 
and hence may be thought to have ARCH. For example, suppose the true 
model is 


Xt = Et + Y21%t-1Et-1 (6.5) 


As E(x) =0 and E(x,x;+;) =0, 1>0, the use of traditional modelling tech- 
niques may identify the trivial ARMA model x, = €;, where 


Ep = Et + YQ Et-1€ 1-1 

Squaring this and taking expectations gives 
x2 ee ae 

E(é [ea Xeon 3) = o4 + Y51045 


Now, the LM statistic for testing whether £, is ARCH(1) is T-R’ from the 
regression of £ on a constant and €7_,; given the above expectation, such a 
statistic may well be large even if the correct model is really the bilinear 
process (6.2). 

The correct LM statistic for testing x, = £, against the bilinear alternative 
(6.5) is, in fact, T-R? from the regression of €; on a constant, €;_; and cae 
In general, if ¢(B) and 6(B) in (6.5) are of orders P and Q respectively, 
then the LM statistic for testing (6.5) against the simple linear ARMA 
specification (5.13) is T-R? from the regression of & on a constant, 
Xt—1,---, Xt-P, Et-1;---E1-Q; and Bee tS 1,..., R, Slang S; the 
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statistic is distributed as x;,,. Weiss shows, however, that such a test will not 
have the correct size if, in fact, ARCH is present as well; nor, indeed, will the 
LM test for ARCH have the correct size if bilinearity is present. 

Weiss (1986b) shows that LS and ML estimates of the bilinear model 
(5.19) coincide. Although estimation of a bilinear model is straightforward, 
identification of that model can pose difficulties, particularly when, as we 
have seen, both bilinearity and ARCH are present and one can be confused 
with the other. 

Weiss thus considers the combined bilinear model with ARCH errors — i.e. 
the bilinear process (6.2) with the ARCH specification (5.14). The identifi- 
cation of this model is based on the relative difficulties introduced by the 
different specification errors. First, ignoring bilinearity can lead to residuals 
appearing to have ARCH even though they may not be autocorrelated. On 
the other hand, misspecifying the ARCH will affect the variance of a process 
but not the specification of the mean equation. Given the greater complexity 
of bilinear models and the difficulties faced in their specification, this sug- 
gests that it is easier to mistake bilinearity for ARCH than vice versa. Weiss 
thus suggests that the bilinear model should be specified before ARCH is 
considered explicitly. 

The suggested procedure is to use the SACFs of x7, €, and & and asso- 
ciated LM tests to specify the bilinear process after a pure ARMA model has 
been identified and fitted by conventional techniques. The SACFs, which do 
not allow for ARCH, will suggest possible bilinear specifications or extra 
bilinear terms, and the formal tests, which do allow for ARCH, can then be 
used to determine which specifications are appropriate. Because we wish to 
test bilinearity in the possible presence of ARCH, however, the LM test, 
although not requiring the actual form of ARCH, nevertheless does not have 
a TR representation; the exact form, derived in Weiss (1986b), is 


ðe > Oer OF; _ Off 
(ass) (mast) (ast) 


where A contains both the ARMA and bilinear parameters. 
Once the bilinearity has been determined, the ARCH equation can be 


specified using the ACF of the squared residuals obtained from the estima- 
tion of the bilinear model. Estimation of the combined model then follows, 
and overfitting and LM tests for extra ARCH or bilinear parameters can be 
undertaken. 
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Since the LM test for bilinearity in the presence of ARCH does not have 
the usual T-R° form, and because the subsequent ARCH test requires first 
estimating a bilinear model, this procedure is rather burdensome if we just 
want a simple test for non-linearity that is sensitive to both ARCH and 
bilinear alternatives. Higgins and Bera (1988) thus propose an easily com- 
puted simultaneous test for a joint ARCH and bilinear alternative. This is an 
LM test whose construction exploits the result that the individual LM tests 
for ARCH and bilinearity are additive: the joint test statistic is thus the sum 
of the individual test statistics. Moreover, because the two forms of non- 
linearity are considered simultaneously, the LM test for bilinearity again has 
the standard T-R’ representation, being the test outlined above. Hence, the 
combined test statistic will be distributed as xs 4p 

Maravall (1983) considers an alternative form of bilinearity in which x, is 
given by the ARMA process 


(B)(x — u) = O(B) a, 


but where the uncorrelated sequence {a,} is bilinear in a, and the strict white- 
noise sequence {¢;,}: 


R S 


Ay = Et + X X Y ig iE tj 


i=l j=l 


This may be interpreted as a bilinear model ‘forecasting white noise’. 

Giraitis and Surgailis (2002) describe a general class of fractional bilinear 
models that exhibit long-run dependence in both conditional mean and 
variance. These models have the general form 


Xt = Et (» F 3 ya) + ao + 3 QiXt—i 


i=1 i=1 


The AR(co) model is obtained when y;=0 for i > 0 and the linear ARCH 
(LARCH) model when a;= 0 for i > 0. This bilinear model can be shown to 
be equivalent to an ARCH(oo) process when yọ = 0. Depending on the rate 
of decay of the autocorrelations for x, and x7, the process is dominated by 
long memory in the mean or variance, giving rise to either an ARFIMA or 
LARCH model, respectively. 

Kristensen (2005) has recently established necessary and sufficient con- 
ditions for the stationarity of various ARCH models that can be written as 
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subdiagonal bilinear models — e.g. GARCH, power GARCH, EGARCH, etc. 
The sufficient conditions employ the Lyapunov exponent of the model and 
are weaker than those established previously in the literature. For example, 
consider the following bilinear model: 


x; = (at dey) xy) = axı + b EX1 


A sufficient condition for stationarity is that the Lyapunov exponent 
E(log(|a + be;|)) is less than zero. This becomes a necessary condition if we 
can further assume model irreducibility — something that is not true for the 
model given above (see Kristensen, 2005). 

How useful are bilinear models in modelling financial time series? 
De Gooijer (1989), among others, presents evidence to suggest that such 
processes can provide useful models for certain daily stock return series, 
although the residual variance of the bilinear models are usually only mar- 
ginally smaller than those obtained from alternative linear models. 


Example 6.1 Is the dollar/sterling exchange rate bilinear? 

Given the above discussion, is it possible that the GARCH model fitted to 
the dollar/sterling exchange rate in example 5.5 is a misspecification and the 
true process generating the series is of bilinear form? An obvious way to 
proceed is to consider the SACFs and PACFs of the differences and squared 
differences. Recall that in example 5.5 it was found that the SACF of é; = 
Ax; was consistent with a white-noise process. For €?, all the first twelve 
sample autocorrelations are significant, as are the first eleven partial auto- 
correlations, which suggests that an ARMA(1, 1) process could be appro- 
priate. This pair of findings is consistent with a diagonal bilinear model with 
R= S=1. The LM test for such bilinearity, obtained from regressing €; on 
é,-, and é?_,, produced a T-R’ of just 1.58, distributed as x7, thus indicating 
no evidence in favour of bilinearity. Of course, this statistic is strictly valid 
only in the absence of ARCH, which we know exists. Construction of the 
ARCH-adjusted statistic produced a value of only 0.48, however, confirming 
the absence of bilinearity. 


Example 6.2 Modelling IBM stock prices as a combined bilinear 

and ARCH process 
The daily closing price for IBM common stock for the 169 trading days 
starting 17 May 1961, presented as part of series B in Box and Jenkins (1976) 
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IBM common stock price (daily from 17 May 1961) 


and plotted as figure 6.1, has been investigated by various researchers (see, 
in particular, Weiss, 1986b). Conventional (linear) identification procedures 
suggest that the differences of the series, denoted x,, follow an MA(1) 
process, least squares (LS) estimation of which yields 


x, = &,—0.262,1, 02 = 248, ne=—0.02, rz =0.18 


An LM test for ARCH(1) errors, ignoring the possibility of bilinearity, 
yields a T-R? statistic of 15.1, which is clearly significant (note that this 
confirms the evidence of non-linearity provided by the lag one auto- 
correlation of £2). Tests for first-order diagonal bilinearity yield values of 7.9 
and 8.1, the latter being corrected for first-order ARCH. Incorporating an 
ARCH(1) error specification yields, on ML estimation, 


Xe = Et — 0.24e1, 07 =248, fie =0.01, rie = 0.02 


o’ = 17.9 + 0.28 £? 
(2.6) (0.12) 


and then including bilinear terms produces either 


Xt = Et — 0.24 €t-1 + 0.023 y_16+-1, 02 = 23.7 
(0.08) (0.010) 
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or 


X; = as — 0.23 ap], o? = 23.7 
(0.08) 


At = Et + 0.023 At—1Et-1 
(0.010) 


o° = 18.3 + 0.23 e? 
(2.6) (0.11) 


The bilinear terms are significant and their introduction decreases the 
estimate of the ARCH parameter somewhat. Note that the mean equation for 
the second model can be written as 


X= Et— 0.23€;-1 + 0.023x;-1€ 1-1 = 0.005(a; 2Et—1 — At-2Et 2) 


so it is no surprise that the fits of the two models are virtually identical. The 
same specification is also arrived at if a bilinear process is first fitted after tests 
for bilinearity on the original MA(1) model are performed. 


6.1.3 State-dependent and related models 


Using the concept of ‘Volterra expansions’, Priestley (1980, 1988) shows that 
a general relationship between x, and £, can be represented as 


Xt = f (eis Xt-p E11- - -3 Et—q) (6.6) 


If f(-) is assumed analytic, the right-hand side of (6.6) can be expanded in a 
Taylor’s series expansion about an arbitrary but fixed time point, allowing 
the relationship to be written as the state-dependent model (SDM) of order 
(p, 4) 


p q 


xe — XO pilm) = u(x) +Y Ei )eni (6.7) 


i=1 i=l 
where x, denotes the state vector 
X; = (xp, +++) Xt—p+1, Etze jercg) 


Priestley (1980, p.54) remarks that this model has the interpretation of a 
locally linear ARMA model in which the evolution of the process at time 
t— 1 is governed by a set of AR coefficients, {ġ;(-)}, a set of MA coefficients, 
{6;(-)}, and a local ‘mean’, {j1;(-)}, all of which depend on the ‘state’ of the 
process at time t— 1. 
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If w(-), {b(-)} and {6,(-)} are all taken as constants, i.e. as independent of 
X;_1, (6.6) reduces to the usual ARMA(p, q) model. Moreover, if only u(-) 
and {¢ġ;(-)} are taken as constants but we set 


p ; 
6;(X;-1) = 0; + a Yjx-j i= 1, 


then the SDM reduces to the bilinear model (6.2), with R= p and S= 4q. 

The SDM class of non-linear models can also be shown to include the 
threshold AR model (Tong and Lim, 1980), the exponential AR model 
(Haggan and Ozaki, 1981) and various other non-linear specifications that 
have been developed over recent years; for example, a non-linear AR(1) 
model could be 


X = X1 + exp(-yx_1)P2X-1 + Er 


Haggan, Heravi and Priestley (1984) provide an extensive study of the 
application of SDMs to a wide variety of non-linear time series, although 
they use no financial, or even economic, data. 


6.2 Regime-switching models: Markov chains 
and smooth transition autoregressions 


An alternative way of introducing asymmetry is to consider ‘regime- 
switching models. We consider here two of the most popular of these: 
the two-state Markov model and the smooth transition autoregression. 
Hamilton (1989, 1990), Engle and Hamilton (1990) and Lam (1990) propose 
variants of a regime-switching Markov model, which can be regarded as a 
non-linear extension of an ARMA process that can accomodate complicated 
dynamics such as asymmetry and conditional heteroskedasticity. The set-up 
is that of the UC model developed in chapter 4, section 1 — i.e. 


Xt = Ze + uy (6.8) 


where, again, Z, is a non-stationary random walk component, but where its 
drift now evolves according to a two-state Markov process: 


Za = (Sr) + 2-1 = A +48; + 2-1 (6.9) 
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where 

P(S, =1|S;1=1) =p 
P(S, =0|S,1=1)=1—p 
P(S, =1|S,-1 =0) =1—q 
P(S, = 0|S;-1 = 0) = q 


The component u, is assumed to follow an AR(r) process 
O(B) uy = €r (6.10) 


where the innovation sequence {¢,} is strict white noise, but @(B) is allowed 
to contain a unit root so that, unlike the conventional UC specification, u, 
can be non-stationary. In fact, a special case of the conventional UC model 
results when p= 1 — q; the random walk component then has an innovation 
restricted to being a two-point random variable, taking the values zero and 
one with probabilities q and 1 — q, respectively, rather than a zero-mean 
random variable drawn from a continuous distribution such as the normal. 

The stochastic process for S, is strictly stationary, having the AR(1) 
representation 


St = (1 av q) ii ie Sy} + V; 


where 4=p+q-—1 and where the innovation V, has the conditional 
probability distribution 


V,=(1—-p)|Si1=1) =p 
V,=—p|S:-1=1)=1-p 
V, 


This innovation is uncorrelated with lagged values of S,, since 
E(Vi| S; = 1) = EVs | SSO) = 0 forj>1 
but it is not independent of such lagged values as, for example, 


E(V? | S:1 = 1) = p(1 — p) 
E(V; | S;1 = 0) = q(1— q) 
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The variance of the Markov process can be shown to be 


0-90-09 
' (2-p-q47 


As this variance approaches zero — i.e. as p and q approach unity — so the 
random walk component (6.9) approaches a deterministic trend. If (B) 
contains no unit roots, x; will thus approach a TS process, whereas, if @(B) 
does contain a unit root, x, approaches a DS process. 

Given xe , ML estimates of the model are obtained by first expressing 
(6.8) as 


Ut = Ut—-1 — Xt — Xt-1 — Ao — 1S; 

and solving backwards in time to yield 

Ut = Xp — Xo — Qot — ay ae Si + Uo (6.11) 
Using (6.10) and (6.11), the innovations €, can be expressed as 

Er = (B)(x: — xo — aot) + (1) mo 


— a (1) > Si + ay p9 by a) St-j+1 
i=l ; 


j=1 \ kj 


Assuming that the innovations are normal, this expression can be utilised to 
calculate the log-likelihood function on noting that this can be decomposed 
as the sum of the conditional (on past observations) log-likelihoods. These 
conditional log-likelihoods depend on unobserved current and past real- 
isations of the Markov states. A recursive relationship can be shown to hold 
between the conditional distribution of the states and the conditional like- 
lihood of the observations, and this can be exploited to obtain an algorithm 
for evaluating the log-likelihood function. Inferences about the unobserved 
components and states are then obtained as by-products of this evaluation: 
details of the algorithm may be found in Hamilton (1989) and Lam (1990). 

The Markov approach assumes that the process can shift randomly and 
abruptly from one regime to the other. An alternative is to consider a process 
in which the transition from one regime to the other occurs only once and in 
a smooth fashion. We have already encountered a model of this type in 
chapter 3, section 6, the logistic smooth transition regression, or LSTR, trend 
model, and this idea is easily extended to smooth transition AR models, 
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termed STAR models by Terdsvirta (1994). The logistic STAR, or LSTAR(p), 
model is defined as 


P P 
Xt = Mio + D T1iXt—i + (m T a mits) ` Sr a(Y, €) + ur 
where S,a(y,c) = (1 + exp(—y(xa—)))’ is the smooth transition. 


An alternative model replaces S;a(y,c) with Sf j(y*,c*) = (1 — exp(—y* 


(x-a — c*)’)), which is known as the exponential STAR (ESTAR) model. 
Either model can be estimated by NLS for a given value of the delay para- 
meter d, although, as Terasvirta (1994) and van Dijk, Terdsvirta and Franses 
(2002) discuss, obtaining convergence and accurately estimating the 
‘smoothing’ parameter, y or y*, is not always easy. 


Example 6.3 Are there long swings in the dollar/sterling exchange rate? 
In this example, inspired by Engel and Hamilton (1990), we fit a two-state 
Markov process to quarterly observations on the dollar/sterling exchange 
rate from 1972 to 1996 inclusive and, in the spirit of Engel and Hamilton, ask 
whether the series is characterised by ‘long swings’ — ie. a sequence of 
stochastic segmented trends. 

This exchange rate is close to being a driftless random walk, so that the 
differences are approximately white noise, but not strict white noise, as they 
are conditionally heteroskedastic. We therefore fitted the two-state Markov 
model, with ¢(B) = (1 — B), to the series using Hamilton’s (1990) expecta- 
tion maximisation (EM) algorithm (see also Engle and Hamilton, 1990). The 
differences are thus given by 


Ax; = Qo + aS; + Ey 


which can equivalently be interpreted as a model in which Ax, is assumed to 
be drawn from a N (uo, 05) distribution when S,=0 and a N (u, 0?) dis- 
tribution when S,= 1, where 49 = @ and Ui = Qo + a}. 

This simple model allows a wide variety of exchange rate behaviour. For 
example, asymmetry in the persistence of the two regimes can be charac- 
terised by uo being large and positive and p being small, so that upward 
moves are short and sharp, and u, being negative and small and q being 
large, so that downward moves are drawn out and gradual. If the change in 
the exchange rate is completely independent of the previous state, then we 
have a random walk with p=1—q. The long swings hypothesis can be 
represented by uo and jz, being opposite in sign and p and q both being large. 
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The following ML estimates were obtained, with standard errors shown in 
parentheses: 


jig = 2.605 (0.964), fi, = —3.277 (1.582) 
p = 0.857 (0.084), 4 = 0.866 (0.097) 
62 = 13.56 (3.34), ê? = 20.82 (4.79) 


The estimates associate regime zero with a 2.61 per cent quarterly rise in 
sterling and regime one with a fall of 3.28 per cent. Figure 6.2 shows the levels 
of the exchange rate and a plot of the ‘smoothed’ probability that the process 
was in regime zero at each date in the sample. These smoothed probabilities 
are estimates of the probability that S,= 0 conditional upon the full sample 
of observations and the ML estimates of the parameters (see Engel and 
Hamilton, 1990, for further discussion). The dates at which the exchange rate 
was in an ‘upswing’ — i.e. periods for which these smoothed probabilities are 
greater than 0.5 — are shown as shaded areas. 

These estimates show that movements in the exchange rate are indeed 
characterised by long swings, since the point estimates of p and q are both 
greater than 0.85 and those of uo and u1, as we have seen, are opposite in sign. 
Hence, once the exchange rate is in a particular regime, it is likely to stay there, 
although there is an indication that such swings are shorter in the 1990s. The 
expected length of stay in regime zero is given by (1 — p)'=7.0 quarters, 
while that for regime one is (1 — q)' =7.5 quarters. Two hypothesis tests are 
of interest. The first is the random walk (strictly, the martingale) hypothesis 
that p= 1 — q, for which a Wald test, distributed asymptotically as x7, yields 
the statistic 26.9, which clearly rejects the null. The second is the hypothesis 
that the mean appreciation and depreciation rates are the same, i.e. (4g = — H1. 
This produces a Wald statistic of only 0.09 and so clearly cannot be rejected. 


Example 6.4 An LSTAR model for UK gilt yields 
In this example we fit a smooth transition model to the twenty-year UK gilt 
series, R20, used to derive the spread analysed in examples 2.2, 2.4 and 3.1. As 
R20 is I(1), we analyse the differences of the series, AR20, whose plot is 
shown in figure 6.3. The plot shows a pattern of changing variability, so there 
is certainly the potential for successfully fitting a non-linear model. Within 
the class of linear ARMA models, an AR(2) provides an adequate fit: 


AR20; = 0.313 AR20,_; — 0.122 AR20;_2 + € 
(0.039) (0.039) 


ê= 0.297, Q(12)=9.1, Q(12) = 301 
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Figure 6.2 Dollars/sterling exchange rate (quarterly 1973-1996) and probability of being in state 0 


The Q statistic, which is a portmanteau statistic using the squared 
residuals (see section 5.5.6), reveals considerable evidence of non-linearity. 
We thus initially fitted an LSTAR(2) model (with the delay set at d= 1), but 
this did not eliminate the non-linearity and so we extended the model to a 
combined LSTAR(2)-GARCH(1,1) process, for which estimation yields 
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A R20; = 0.178 A R20; 
(0.040) 


a (0.364 AR20;-1 — 0.501 AR20,.») SS (130. = 0.780) +E, 
262 


(0.222) (0. 
o? = 0.00040 + 0.068 £?_; + 0.929 07_, 
(0.00016) (0.014) (0.012) 


For this model we have ô = 0.294, and Q’(12) = 8.3, so that the GARCH 
error process successfuly removes the non-linearity, but, nevertheless, the 
smooth transition component enters significantly (the smoothness par- 
ameter y is very imprecisely estimated but, as Teräsvirta (1994) discusses, 
this is not unusual for such models). The transition is abrupt at a value 
for AR20 of —0.78, with S,,(-) switching from zero to one at this point. 
When S=0, which we might refer to as the ‘lower’ regime, the mean 
process for AR20 is an AR(1) with a root of 0.18. When S= 1, the ‘upper’ 
regime, the AR(2) process AR20,=0.742AR20,_; —0.501AR20,5 has 
complex roots 0.37 + 0.637 with modulus 0.73 and period of five and a half 
months. By way of contrast, the linear AR(2) model has roots 0.16 + 0.314 
with modulus 0.35 and period of six and a half months. As usual, the 
GARCH process for the errors is almost integrated, the sum of the coeffi- 
cients being 0.999. 
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6.3 Non-parametric and neural network models 


6.3.1 Non-parametric modelling 


Recent advances in computer power have motivated the development of 
non-parametric or semi-parametric time series methods that make very few 
assumptions about the functional form of the underlying dynamic 
dependencies (see Fan and Yao, 2003). Most of these methods allow the 
reduction of observational error through the use of smoothing estimators. 
For expository purposes, we can assume the following non-linear auto- 
regressive model: 


Y, = f(X) FE: = f (Y1, Peeks Y=») Fey 


where f(-) is an arbitrary non-affine function that has to be estimated from 
the data, Y, is a stationary stochastic process and €; is a white-noise process. 

Consider first the general case of estimating f(-) at a particular point in 
time for which X,= x and yı, y1,..., yn repeated observations are available 
for Y; A natural smoothing estimator of f(x) is the average of the available 
data for Y; 


Posni n= NI Fe) te) =f) +N e 
Large-sample theory dictates that this is a consistent estimator, since, as N 
increases, the average of the white-noise term €, will converge to zero. 
Obviously, in a time series context repeated observations for Y, are not 
available for each data point x. Nevertheless, assuming that f(-) is sufficiently 
smooth, for time series observations near x, f(x) can be approximated by an 
average of the corresponding Y;s. This is based on the reasonable assumption 
that, for X;s close to x, Y, will also be close to f(x). In order to account for the 
potentially varying distances between the X,s and x, a local weighted average 
of Y, is appropriate: 


fl) =TI YO wiley (6.12) 


where the weights w(x) depend on the proximity of the Xs to x. 

The different ways of measuring the distance between adjacent points 
and of assigning weights to a given distance give rise to the variety of 
non-parametric methods and techniques that have been developed in the 
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literature. These include kernel regression, neural networks (NNs), orthog- 
onal series expansions, wavelets, projection pursuit, nearest-neighbour 
estimators, splines, non-linear state space models, regressogram approaches, 
convolution smoothing, delta function sequence estimators and median 
smoothing, among others. 

Non-parametric time series methods have apparent advantages over 
parametric alternatives, particularly in cases where little is known or can be 
inferred about the functional form underlying the dynamic dependencies. 
Unfortunately, non-parametric approaches come with significant costs and 
shortcomings. They often require substantial computational resources and 
data. Statistical inference is problematic and often depends on non-standard 
statistical theory and computationally demanding resampling techniques. 
The inherent flexibility of non-parametric models means that, under certain 
circumstances, they are prone to overfitting, where a model performs well 
within the particular sample selected for estimation but has very poor out- 
of-sample forecasting performance. 

A further limitation of non-parametric models is related to the curse of 
dimensionality, according to which the sample size requirement increases 
exponentially with the number of variables. This means that non-parametric 
estimates have consistency rates that are much slower than the typical square 
root convergence of parametric approaches. Also, outliers may cause serious 
problems, since non-parametric models tend to stretch the function space in 
order to include all available observations. Finally, the application of non- 
parametric models is complicated by the requirements for specialised soft- 
ware, the large number of tuning parameters and the variety of available 
model configurations and estimation procedures. 

The remainder of this section discuss two of the most commonly used 
non-parametric approaches: kernel regression and neural networks. 


6.3.2 Kernel regression 


One of the most widely used non-parametric approaches determines the 
weights using a kernel function, defined to have the properties 


K(u) >0 f Kodu=: 


The kernel K(u) is typically a probability density function. This is done for 
reasons of mathematical convenience and has no probabilistic implications. 
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The kernel is usually rescaled by the bandwidth, a positive variable h, 
such that 


K,(u) = h'K(u/h) [ka =) 


The weight function is then defined as 


a Kj,(x — x;) 
ie Da K(x — x) 


w(x) (6.13) 


By substituting the weight function (6.13) into (6.12) we obtain the 
Nadaraya—Watson kernel estimator. A variety of alternative kernel functions 
have been proposed, with the most popular being the Epanechnikov and 
Gaussian kernels. The Nadaraya—Watson kernel estimator can be shown to 
be a local linear regression and, under regularity conditions, can be gener- 
alised to local polynomial and non-linear regressions. The choice of band- 
width may significantly affect the results obtained by kernel methods. If h 
is close to zero, only values of X, close to x will be weighted highly in the 
averaging process. Conversely, if h assumes an excessive value, a large 
neighbourhood of points around each X, will be averaged, and this may lead 
to oversmoothing. Although several different approaches have been 
developed for automatic bandwidth selection (see Hardle, 1990), one must 
also keep in mind the particular objective of the data analysis. For presen- 
tation and descriptive purposes an oversmoothed curve may be desirable, 
while for forecasting a slightly undersmoothed curve may be more appro- 
priate. 

By estimating the function f (x) as a weighted average of the response 
variables in a varying, rather than fixed, neighbourhood of points, we obtain 
the nearest-neighbour estimator. As with kernel regression, this approach 
also fits polynomial regressions locally around each data point. The neigh- 
bourhood is defined as those Xs that are the k-nearest neighbour of x in 
terms of a Euclidean distance. The smoothing parameter k of the nearest- 
neighbour approach has a similar role to that of the bandwidth in the kernel 
method. It assumes values between zero and one and denotes the number kT 
of observations nearest to a given point that should be included in the 
neighbourhood and the local polynomial fit. 

A number of techniques have been developed to reduce the computational 
burden of non-parametric estimation. For example, kernel estimators 
are typically estimated using local polynomial regressions at a subset M of 
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equidistant points of the whole sample. Similarly, for the nearest-neighbour 
estimator, rather than fitting local polynomials around each point in the 
sample, the Cleveland subsampling approach provides an adaptive algorithm 
that skips adjacent points in such a way that the subsample is representative 
of all the values of the regressor. For relatively large samples, well over 100, 
the computational savings can be drastic with often insignificant loss of 
information. 

A complete treatment of computational shortcuts can be found in Hardle 
(1990). In time series models where several lags are considered, researchers 
sometimes assume an additive model and estimate separately via kernel or 
nearest-neighbour methods the relationship between the dependent variable 
and each of the right-hand variables. This assumption conveniently reduces 
the computational and data problems caused by the curse of dimensionality. 
Several techniques, such as the regression tree and projection pursuit 
regression, have been developed in order to accommodate multidimensional 
non-parametric estimation under model additivity in a time series and 
regression context (see Hardle, 1990). 

To demonstrate the application of the above methods, we simulate 200 
observations from a cubic deterministic trend process, 


yp =5410 RP +e, 


where €, is a standard normal variate. Figure 6.4 depicts the results of 
applying kernel and nearest-neighbour smoothers in estimating the func- 
tional relationship between y; and t. For the kernel method we used a 
Nadaraya—Watson estimator with an Epanechnikov kernel function. The 
kernel bandwidth was first set according to an arbitrary rule suggested by the 
estimation software (EViews 5) based on the range of the data: h=0.15 
(Vmax — Ymin). For the nearest-neighbour estimator, a value of k=0.3 was 
originally selected with a local linear polynomial. In order to demonstrate the 
effects of undersmoothing, we also show the results for smaller values for h 
and k, respectively. It is evident that both estimators offer a reasonably good 
fit to the data, except for the area near the origin. 


6.3.3 Neural networks 


Artificial neural networks (ANNs), often just called neural networks, refer 
to a broad class of non-parametric models that have found much popu- 
larity in recent years across a wide spectrum of disciplines, including 
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computer science, psychology, biology, linguistics, statistics, forecasting and 
pattern recognition (for a textbook treatment, see Haykin, 1999). These 
models originate from research in the cognitive sciences on emulating the 
structure and behaviour of the human brain. NNs have been applied, with 
varied levels of success, to problems in finance and econometrics (for a 
treatment from a statistical and econometric perspective, see Cheng and 
Titterington, 1994, Kuan and White, 1994, and White, 2006, and the ref- 
erences therein). 

One of the most commonly used NN variations is the feed-forward type, 
also called the multilayered perceptron (MLP), which can be used for non- 
parametric regression and classification. These models are organised in three 
basic layers: the input layer of independent variables, the output layer of 
dependent variables and one or more hidden layers in between. A transfer 
function regulates the dependencies (synapses) between the elements 
(neurons or nodes) of each layer. 

In mathematical formulation, a univariate autoregressive MLP model with 
a single hidden layer can be represented as 


Yi = Po Zt + De B;G(y; Zi) + € (6.14) 


where y, is the output variable, z= (1, Y+1, Y2 - - -» Ytp) is the input vector 
of lagged y,s, Bj z is a linear unit and the £; are the model parameters 
(connection weights). G(-) is the transfer (or activation) function with 
parameters y; This is a bounded non-linear function and operates in a 
manner similar to that of the transition functions used in STAR models. A 
number of different transfer functions are employed in practice, with the 
most common being the hyperbolic tangent and the logistic. The second 
term in (6.14) refers to the hidden layer in the MLP. Obviously, (6.14) 
collapses to a standard AR(p) model when the transfer function is linear. The 
residual term £; is usually assumed to be an iid random variable. 

The basic MLP described above can easily be extended to include multiple 
output/input variables and hidden layers, respectively. An interpretation of 
the MLP mechanics can be given as follows: units in the input layer send 
signals to y, over connections that are amplified by weights y;. The signals 
arriving at each hidden layer unit are summed and then the outcome signal of 
the hidden unit is produced using the transfer function. The weighted signals 
from the hidden layer(s) are transmitted to the output layer. The most 
celebrated property of MLPs, and, indeed, most NNs, lies in their universal 
approximation capability: they can approximate any function, under mild 
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regularity conditions, to any desired degree of accuracy by increasing the 
number of units in the hidden layer (see Hornik, Stinchcombe, and White, 
1989). In the context of non-parametric regression, White (1990) proves that 
NNs can be used for the consistent estimation of any unknown square 
integrable conditional expectation function. 

The high flexibility, rich parameterisation and non-linear nature of NNs 
renders estimation particularly difficult (see White, 2006). One of the main 
problems is that NNs are very susceptible to overfitting. The estimation 
strategy of NNs is rather different from traditional linear econometric model 
estimation, in that it typically involves two steps: sample optimisation 
(training or learning) with recurrent out-of-sample testing (cross-validation), 
and out-of-sample testing. The in-sample optimisation is usually termin- 
ated, prior to reaching the maximum possible performance, when the 
performance of the model in the cross-validation sample starts to deteri- 
orate. In this way overfitting is avoided and a good forecasting performance 
in the testing sample is more likely. The ability of an NN to perform well 
in out-of-sample forecasting refers to the generalisation capabilities of the 
model in neural network jargon. The estimation (training) algorithms used 
vary considerably and typically involve adjusting the direction of the negative 
gradient of some error criterion (e.g. mean squared or absolute error). 

Several iterative methods have been proposed for solving this non-linear 
estimation problem, and usually these are combined with additional con- 
straints in order to ensure the smoothness of the estimated function. In the 
case of MLPs, most of these methods are based on variants of the back- 
propagation algorithm, which works backwards from the output layer and 
uses a gradient rule to vary biases and weights iteratively. The algorithm is 
sensitive to local minima in the error space and is therefore applied several 
times with different starting values. 

An additional pitfall in MLP estimation concerns the selection of the 
appropriate model architecture: the number of hidden layers and the 
number of neurons in each layer. One can either start with a small model and 
add hidden layers and neurons until performance is optimal, or start with an 
oversized model and prune small weights or reduce its size. Sometimes a 
preliminary optimisation is undertaken, using a genetic algorithm, simulated 
annealing or some other heuristic method, in order to select a good set of 
starting values and model architecture and to reduce the computational 
burden. Model performance is often evaluated according to parsimony 
metrics such as the BIC and AIC. 
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Figure 6.5 VIX implied volatility index (daily January 1990-September 2005) 


A major problem with MLPs is their ‘black box’ property, since the par- 
ameters and structure of the model offer little intuition and conclusions can 
be drawn only implicitly via simulation or sensitivity analysis. Moreover, 
assessing the statistical significance of the parameters is problematic. 
Although MLPs have become a popular model for applied non-linear 
modeling in finance, theoretical and applied research on alternatives is also 
under way (see Al-Falou and Trummer, 2003). 


Example 6.5 Modelling the non-linear structure of the VIX 

Indices of the volatility implied by options are becoming increasingly 
popular as a measure of market uncertainty and as a vehicle for developing 
derivative instruments to hedge against unexpected changes in volatility. In 
example 5.2 we fitted a jump diffusion model to the VIX implied volatility 
index. The series, shown in figure 6.5, exhibits extreme variations and appears 
to be a good candidate for non-linear modelling. As shown by Psychoyios, 
Dotsis and Markellos (2006), the VIX is stationary, although the possibility of 
a fractional unit root cannot be excluded. 

We proceed by modelling the logarithms of the VIX using a variety of 
univariate linear and non-linear models, the performance of which are 
summarised in table 6.1 Overall, the results suggest that regime-switching 
models offer the best approximation to the data-generating process, since the 
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Table 6.1 Linear and non-linear models for the VIX 


R BIC AIC Skew Kurtosis JB Q(12) 
ARMA(1,61) 0.968 —5814.6 —5827.2 0.66 6.49 2298 78.5 
ARMA(4, 3) 0.969 —5831.7 —5859.9 0.77 6.65 2584 27.4 
ARFIMA(1,0.76,0) 0.969 —5863.6 —5876.2 0.75. 6.70 2622 37.7 
Bilinear — AR(1) 0.968 —5807.4 —5819.9 0.49 6.38 2045 91.0 
Markov switching 0.974 —5920.9 5942.9  —0.10 3.48 42.7 78.4 
Hamilton 0.973 —5851.2 —5870.0 0.07 4.16 225.8 85.2 
MLP(5:3:1) 0.969 —5778.7 —5825.8 0.78 6.55 2472 42.0 


model with a Markov switching constant and a single autoregressive lag 
produces the highest R? and the smallest BIC and AIC values. This model 
leaves some serial dependencies in the residuals, as measured by a large Q 
statistic. It manages to remove most of the leptokurtosis from the original 
data, however, leaving almost normally distributed residuals. Hamilton’s 
regime-switching mean process also performs reasonably well. 

The ARFIMA (1, d, 0) model has the second best performance in terms of 
the information criteria considered. It implies a long-memory process with a 
fractional root of d= 0.76 and a single autoregressive lag. An MLP with five 
autoregressive lags and three elements in a single hidden layer appears to 
have the worst performance in terms of the BIC and AIC. The MLP was 
estimated via cross-validation on a sample of 100 observations using the 
back-propagation algorithm and 5000 optimisation cycles. The fact that it 
has a performance very close to that of the linear ARIMA models suggests 
that the MLP may have been trapped in a local minimum or that further 
optimisation is necessary. It is interesting to note the small differences in 
performance between the linear and non-linear models considered. 

The parameters of the estimated models, except for the MLP, are as fol- 
lows: 

ARMA(1, 1) 


Yt = — 1.687 + 0.987 yy) + Et + 0.074 €41 
(0.061) (0.003) (0.002) 


ARMA (4, 3) 


ye = — 1.715 + 0.118 ys—ı + 1.209 4-2 + 0.443 y,-3 — 0.774 yi4 
(0.103) (0.039) (0.034) (0.029) (0.040) 


+ €+ — 0.770 €4_1 + 0.495 €,_2 + 0.863 €13 
(0.032) (0.051) (0.031) 
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ARFIMA(1, 0.76, 0) 


APs, = — 1.278 + 0.137 74 + €r 
(0.222) (0.034) 


Bilinear — AR(1) 


Ye = — 1.385 + 0.994 y,_ + 0.189 yj, 14-1 + Et 
(0.728) (0.018) (0.005) 


Markov switching AR(1) and constant 
— 1.798 + 0.984 y,-1 + €t 
(0.065) — (0.002) 
Me = 


0.464 + 0.899 y,—1 + €t 
(0.093) (0.005) 


Hamilton’s regime-switching mean model 


— 0.913 + 0.982 y,_1 + €: 
(0.041) (0.002) 


6.098 + 0.982 yy_1 + Et 
(0.304) (0.002) 


inear dynamics and chaos 


The processes introduced so far in this chapter all have in common the aim 
of modelling stochastic non-linearities in financial time series. This would 
seem a natural approach to take by those used to dealing with stochastic time 
series processes, but a literature has also developed that considers the 
question of whether such series could have been generated, at least in part, by 
non-linear deterministic laws of motion. 

Research on non-linear dynamics is concerned with the behaviour of 
deterministic and stochastic non-linear systems that have an implicit 
discrete- or continuous-time dimension. The field originates in the pio- 
neering work of the great mathematician Henri Poincaré, at the turn of the 
twentieth century, on the stability of the Solar System. Both applied and 
theoretical research has flourished over the past three decades across a 
variety of disciplines, including mathematics, statistics, physics, engineer- 
ing, econometrics, economics and finance. An extensive overview of the 
research on non-linear dynamics, with a bias towards the natural sciences, 
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is given by Hilborn (1997). The meaning of the term ‘non-linear dynamics’ 
seems to vary considerably across scientific disciplines and time periods. 
For example, a popular interpretation since the early 1980s associates non- 
linear dynamics with deterministic non-linear systems and a particular 
dynamic behaviour called chaos. The term ‘chaos’ itself has also received 
several different interpretations, to the extent of becoming a scientifically 
dangerous concept (see Griffeath, 1992). 

This diversity of meanings with respect to chaos is mainly because no 
formal and complete mathematical definition of chaotic systems exists other 
than a somewhat loose symptomatology (see, for example, Berliner, 1992). 
Broadly speaking, chaos is the mathematical condition whereby a simple 
(low-dimensional) non-linear dynamical system produces highly complex 
(infinite-dimensional or random-like) behaviour. Even though these systems 
are deterministic (they have finite ‘phase space’ dimension), they are com- 
pletely unpredictable in the long run, due to ‘sensitive dependence on initial 
conditions’ (or Lyapunov instability). Chaotic systems also invariably exhibit 
power-law behaviour (continuous, broadband and power-law declining 
spectral density) and have ‘fractal’ or ‘self-similar’ pictorial representations 
(‘strange’ or non-integer phase-space dimension attractors). 

An example of a chaotic process is one that is generated by a deterministic 
difference equation 


Xt = ene ee ,Xı—p) 


such that x, does not tend to a constant or a (limit) cycle and has estimated 
covariances that are very small or zero. A simple example is provided by 
Brock (1986), where a formal development of deterministic chaos models is 
provided. Consider the difference equation 


Xt = f (x1), Xo E [0, 1] 
where 


x/a x € [0, a] 


AO ay is x € la, 1] 0<a<l 


Most realisations (or trajectories) of this difference equation generate the 
same SACFs as an AR(1) process for x, with parameter ġ = (2a — 1). Hence, 
for a=0.5, the realisation will be indistinguishable from white noise, 
although it has been generated by a purely deterministic non-linear process. 
For further discussion of this particular function, called a tent map because 
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the graph of x, against x,_, (known as the phase diagram) is shaped like a 
‘tent’, see Hsieh (1991), who also considers other relevant examples of 
chaotic systems, such as the logistic map: 


Xp = 4x (1 — 4-1) = 40-1 — 4x7), 0<xm<1 


This also has the same autocorrelation properties as white noise, although x? 
has an SACF consistent with an MA(1) process. 

Are such models useful in finance? One must keep in mind that systematic 
research on chaos was first undertaken in the natural sciences and, in 
accordance with tradition, the enthusiasm and mathematics associated with 
the new field were quickly carried over to economics. Many non-linear 
systems in the natural sciences are low-dimensional, either by experimental 
construction or from first principles, so chaos is a natural choice for 
explaining complex empirical behaviour. This is because, in deterministic 
systems, the standard types of dynamic behaviour are limited to fixed-point 
equilibria and limit cycles, and hence complexity can arise only in the 
presence of chaos or high-dimensionality. High-dimensional or ‘stochastic 
chaos is of little interest, since it is typically considered for all practical 
purposes to be equivalent to randomness. Unfortunately, a deduction 
analogous to the above is not possible in finance, since it is generally accepted 
that financial markets and agents are inherently highly stochastic and 
evolving and hence there is no practical need to resort to chaos in order to 
explain complex behaviour. Although chaos may have a prominent place in 
the study of deterministic low-dimensional dynamic behaviour, it seems to 
have a limited and rather exotic role to play in the context of stochastic linear 
and non-linear dynamics. 

It is therefore not surprising that applications of chaos theory in finance 
and economics have been far less popular and successful than in the natural 
sciences. Nevertheless, the interest in chaos in finance is considerable and 
continues to persist. Much has been motivated by the ability of chaotic 
systems to produce complicated behaviour without resorting to exogenous 
stochastic factors and shocks. An underlying hope was that the apparently 
stochastic behaviour and long-run unpredictability of financial systems 
could be the product of a low-dimensional, and hence tractable, chaotic 
system. 

Broadly speaking, research on chaos in finance has followed two distinct 
directions. The first starts with a non-linear deterministic theoretical model 
and demonstrates that specific configurations can produce chaotic behaviour 
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(see the selective review by Fernandez-Rodriguez, Sosvilla-Rivero and 
Andrada-Feélix, 2005). For example, Brock (1988) considers some models of 
equilibrium asset pricing that might lead to chaos and complex dynamics. 
In these models, the idea that there should be no arbitrage profits in financial 
equilibrium is linked with the theory of economic growth to show how 
dynamics in the ‘dividend’ process are transmitted through the equilibrating 
mechanism to equilibrium asset prices. These dynamics can be linear, non- 
linear or chaotic depending on the constraints imposed on the models. 
Although several models of this type were found to be capable of producing 
such ‘mathematical’ chaos, especially in economics, empirical validation 
was never undertaken. Furthermore, the underlying strong assumptions 
regarding deterministic dynamic behaviour are highly questionable (see 
Granger, 1992). 

The second approach is model-free and uses non-parametric procedures 
to test observed economic time series for signs of chaotic behaviour (e.g. see 
the recent studies by Fernandez-Rodriguez, Sosvilla-Rivero and Andrada- 
Félix, 2005, Kyrtsou and Serletis, 2006, and Shintani and Linton, 2006, and 
the references contained therein). Although some studies claim to have 
found ‘empirical’ chaos, such evidence cannot be considered as conclu- 
sive since the testing procedures used are susceptible to problems with 
respect to autocorrelation, small sample size, noise, heteroskedasticity, non- 
stationarities, aggregation, microstructures and seasonalities. More import- 
antly, all the evidence presented is merely circumstantial, since no formal 
testing procedure has been developed for stochastic time series where chaos 
enters as the null hypothesis. Even if chaos was present in the data, estimating 
the unknown parameters of the underlying model would be practically 
impossible (Geweke, 1993). 

Finally, the literature has not provided convincing arguments about the 
practical implications of chaos in finance and the marginal benefits of 
assuming chaotic behaviour. In addition, there has been little empirical 
evidence of chaotic dynamics uncovered in financial time series, although 
much evidence of other types of stochastic non-linearities. This has been 
obtained from a variety of tests for non-linearity, to which we now turn. 


6.5 Testing for non-linearity 


As the previous sections have demonstrated, there has been a wide variety of 
non-linear models proposed for modelling financial time series. We have, in 
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particular, compared and contrasted the ARCH and bilinear models, and in 
so doing have discussed LM tests for each. Nevertheless, given the range of 
alternative non-linear models, it is not surprising that a number of other tests 
for non-linearity have also been proposed. Since the form of the departure 
from linearity is often difficult to specify a priori, many tests are ‘diagnostic 
in nature — ie. a clear alternative to the null hypothesis of linearity is not 
specified, and this, of course, leads to difficulties in discriminating between 
the possible causes of ‘non-linear misspecification’. 

The detection of non-linearity is further complicated by the fact that it has 
similar symptoms to other types of time series behaviour. For example, 
Andersson, Eklund and Lyhagen (1999) have shown that long-memory may 
lead to spurious rejection of the linearity hypothesis. As demonstrated by 
Granger and Terdsvirta (1999) and Diebold and Inoue (2001), the opposite 
may also be true, since some non-linear processes exhibit characteristics that 
justify modelling via a long-memory model. An interesting direction con- 
siders testing and modelling non-linearity within a long-memory process 
(see, for example, Baillie and Kapetanios, 2007). Koop and Potter (2001) 
have shown that unpredictable structural instability in a time series may also 
produce erroneous evidence of threshold-type non-linearity. An alarming 
finding by Ghysels, Granger and Siklos (1996) is that non-linear trans- 
formations, such as the X11 seasonal adjustment procedure, that are rou- 
tinely applied prior to time series modelling may also induce non-linear 
behaviour. Equally, seasonal adjustments may smooth out structural shifts 
and switching between regimes (see Franses and Paap, 1999). Finally, as 
discussed by van Dijk, Franses and Lucas (1999) and de Lima (1997), neg- 
lecting outliers and non-normalities may also lead to spurious evidence of 
non-linearity. Despite the difficulties, testing for non-linearity is usually an 
effort well spent, since the burden associated with the specification and 
estimation of non-linear models is often substantial and complex. 

Empirical applications and simulation studies (e.g. Lee, White and 
Granger, 1993, and Barnett et al., 1996, 1997) have shown that no non- 
linearity test dominates in all situations and that power varies with sample 
size and the characteristics of the underlying stochastic process. This means 
that, in practice, it is advisable to apply a variety of non-linearity tests to the 
data to guide the model specification process. Some of the most popular tests 
that have appeared in the financial literature are described in the remainder 
of this section. 

On the basis of Volterra expansions, Ramsey (1969), Keenan (1985) and 
Tsay (1986) provide regression type tests of linearity against unspecified 
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alternatives. These appear to have good power against the non-linear moving 
average (see Robinson, 1977) and bilinear alternatives, but possibly have low 
power against ARCH models. In developing these tests, we assume that an 
AR(p) process has been fitted to the observed series x, and that the residuals, 
ep and the fitted values, x, = x, — es, have been calculated. Ramsey’s original 
regression error specification test (RESET) is constructed from the auxiliary 
regression 


P h ah 
= ae PiXi—i + pe etm 


and is the F-test of the hypothesis Ho: ô&j= 0, j=2,..., h. If h=2, this is 
equivalent to Keenan’s test, while Tsay augments the auxiliary regression 
with second-order terms: 


p p p 
& = a PiXt—i + ye bi Xt—i Mj + Vt 


in which the linearity hypothesis is Ho: 5;;= 0, for all 1 and j. These tests have 
LM interpretations and Tsay’s test has power against a greater variety of non- 
linear models than the RESET. A further extension is provided by Terasvirta, 
Lin and Granger (1993), in which the auxiliary regression becomes 


P P P 
= ae PiXt—i + aan ae BijX1— iM j 
P p p 
+ Des Se ae Dijk X1—iX1—jX1—k Vt 


with the linearity hypothesis now being Ho: 6;;=0, ô;x= 0 for all 7, j and k. 
This is related to the ‘neural network’ test discussed by Lee, White and 
Granger (1993) and appears to have better power. 

A portmanteau test for nonlinearity developed by McLeod and Li (1983) is 
based on the Ljung—Box statistic calculated using the squared residuals 
obtained from a linear fit. The test exploits an idea by Granger and Andersen 
(1978) that, if the residuals from an AR(p) fit are iid, then the cross-product 
of their squares should have a correlation structure that is the same as that 
of the square of their cross-products (see section 6.1). Under the null 
hypothesis of linearity, the first m autocorrelations among the squared 
residuals are zero and the Ljung—Box test statistic is distributed as Xap 
This test has good power against ARCH behaviour and is asymptotically 
equivalent to the LM test statistic developed by Engle (1982): see Granger 
and Teräsvirta (1993). As expected, the power of the test is sensitive to 
departures from normality. 
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When residuals from an ARMA-GARCH model are used the test no 
longer follows a x’ distribution and must be corrected along the lines sug- 
gested by Li and Mak (1994). Pena and Rodriguez (2005) have recently 
proposed a simple extension of this test that employs information criteria in 
the selection of the optimal lag structure for the autoregressive models fitted 
to the squared residuals. The checking procedure posits that, if the optimal 
lag structure is non-zero, then it can be inferred that there are non-linearities 
present in the data. Simulation evidence shows that, when the Bayesian 
information criterion (BIC) is used, this test performs favourably for a wide 
variety of non-linear processes and sample sizes. It was found, however, to 
have poor power against threshold non-linear processes and certain types of 
heteroskedastic behaviour. 

Once evidence in favour of non-linearity has been found, Hsieh (1989b) 
has developed a test that can shed light on the type of non-linearity present. 
More specifically, the test attempts to discriminate between two types of 
non-linearity: ‘additive’ and ‘multiplicative’. In the former, non-linearity 
enters solely through the conditional mean of the process 


et = 8 aie siete +- , Ct-k) + ur 


where g(-) is an arbitrary non-linear function. This suggests that a model 
from the bilinear or SDM family may be appropriate. Multiplicative non- 
linearity manifests itself through the conditional variance, thus pointing 
towards the direction of an ARCH-type model: 


e = g(%-1, see Xtoky Ct-ly +e, e-k) Ut 


The test exploits the fact that, unlike additive dependence, multiplicative 
dependence implies that 


Ele} Meise -< eK Cig hee eR) = 0 (6.15) 


Assuming that g(-) is at least twice continuously differentiable, it can 
be approximated via a Taylor expansion around zero. The test is based on the 
fact that the residuals, u, must be uncorrelated with the terms in this 
expansion under multiplicative dependence. The test is implemented by 
estimating the scaled third moment of the data: 


T! 5 Etet—iêt-j 
(T~ D e2)? 


Teee(i,j) = 
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Under the null hypothesis of multiplicative non-linearity, T°? reec(i, j) is 
asymptotically normally distributed with a variance that can be consistently 
estimated by 


TS ee ser 


2 t “t-i~t-j 
(T De) 


As discussed by Hsieh (1989b), the approach is similar to that of Tsay (1986), 
who tests jointly ree(i,j) for 0 < i, j< k. The difference is that Tsay’s test 


assumes that e, is tid while Hsieh’s test assumes only that the expectation in 
(6.15) is zero under sufficient moment conditions. The former test thus 
captures any departures from linearity while the later rejects the null only in 
the presence of additive, but not multiplicative, non-linearity. 

Non-linearity tests have also been developed on the basis of the revers- 
ibility of a stochastic process. A stationary process is said to be time-reversible 
(TR) if all its finite dimensional distributions are invariant to the reversal of 
time indices. In other words, if the probabilistic structure of a time series is 
identical whether going forwards or backwards in time, the series is time- 
reversible; otherwise it is said to be irreversible. Sequences that are iid and 
stationary Gaussian, such as ARMA processes, will be time-reversible. A 
linear, non-Gaussian process will, in general, be time-irreversible, however. 
Ramsey and Rothman (1996) have proposed the TR test statistic, estimated 
for various lags k as 


TR(k) = Boi(k) — Bia(k) 


where B i(k) and Bi 2(k) are the method of moments estimators of the 
bicovariances E (xx k) and E (xx? DE respectively. These can be estimated 


using the residuals from a linear fit as 
a = T ij ea 
B,;(k) = (T — k) Soe eed ij=1,2 


Although ARCH processes are irreversible, the TR test has no power 
against them since their bicovariances are zero. Under the null hypothesis of 
time reversibility, TR has an expected value of zero for all lags. When the 
process is iid, TR is asymptotically normally distributed with variance 


(Mg Hr = Ms), M9 (T = 2k) = Ee’) 
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As shown by Rothman (1992), the convergence to asymptotic normality is 
adequately fast even when the process is non-iid and the test is applied to 
residuals from a linear fit with non-normal errors. Rothman shows that the 
test has reasonable power against simple bilinear and threshold autoregressive 
(TAR) models and that the distinct rejection pattern of the test can be utilised 
in the model identification process. A time reversibility test with milder 
moment restrictions than the TR has been proposed by Chen, Chou and 
Kuan (2000). 

Non-parametric tests of serial independence have also attracted interest as 
a means of searching for non-linearity (see Dufour, 1982). These include a 
wide variety of procedures, including sign, permutation and rank tests for 
independence. Non-parametric approaches have also been developed to test 
against serial dependence of fixed order (see Pinske, 1998). Most of these 
non-parametric tests are based on the actual series, rather than on stand- 
ardised residuals from some linear fit, and therefore the applicability of their 
limit distributions for, say, AR residuals is mostly unknown. 

A non-parametric test that has created considerable interest is the BDS 
statistic, named after Brock, Dechert and Scheinkman, based on the concept 
of the correlation integral: see, for example, Brock (1986), Brock, Hsieh and 
LeBaron (1991), Brock and Dechert (1991) and Dechert (1996). The test is 
based on the idea that the evolution of the next values of any two blocks of 
observations that are close in some metric should also be close in the same 
metric. For an observed series lat, the correlation integral Cn(£,T) is 
defined as 


Cn(£, T) = eS LX, x) 


Tn(Tw E 1) t<s 
where 
xN = (xt, Xt+1;- , Xt4N—1) 
and 
xN = (Xs, Xs+1; ++ ,Xs4N-1) 


are called ‘N-histories’, I, (x5 ge ) is an indicator function that equals one 
if |x — x^ || <£ and zero otherwise, ||:|| being the sup-norm, and Ty= 
T-N+1. 

The correlation integral is an estimate of the probability that any two 
N-histories, x‘ and x^, are within £ of each other. If the xs are strict white 
noise, then 
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Cn (Zé, T) =? Cy (£, T~“ 
as T — oo, and 


wn(l, T) = VT(Cy(é, T) — Ci (£ T)”) /on(é, T) 


has a standard normal limiting distribution, where the expression for the 
variance o% (£, T) may be found in, for example, Hsieh (1989b, p. 343). 
Thus, the BDS statistic wn(£, T) tests the null hypothesis that a series is strict 
white noise; it is a diagnostic test, since a rejection of this null is consistent 
with some type of dependence in the data, which could result from a linear 
stochastic system, a non-linear stochastic system or a non-linear determin- 
istic system. Additional diagnostic tests are therefore needed to determine 
the source of the rejection, but simulation experiments do suggest that the 
BDS test has power against simple linear deterministic systems as well as 
non-linear stochastic processes. 

A number of non-linearity testing procedures have been developed in 
the frequency domain. These are based on the bispectrum fw, œ;) — ie. 
the third-order moments, in the frequency domain, of a third-order sta- 
tionary series. When appropriately normalised, the bispectrum is constant 
for a linear series, irrespective of frequency, and assumes the value of zero 
for a Gaussian process. More specifically, the normalised bispectrum is 
given by 


blw, wj) = 


where fœ) is the spectral density of the series. Subba Rao and Gabr 
(1980) and Hinich (1982) have developed tests of linearity and normality 
using the bispectrum that have the advantage of not requiring pre-filtering, 
but they have disadvantages in terms of implementation and sample 
requirements. 

A related procedure has been developed by Hinich and Patterson (1985) 
that is based on the sample bicovariance of a series, and it can be thought of 
as a generalisation of the Box—Pierce portmanteau statistic. The test statistic 
is calculated as 


w=) 3 (Gs) 
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where G(r, s) = C3(r, s)V T — s. The sample bicovariances 


= T-s 
C3(r,s) = (T —s) es XtXttrXtts 


can be thought of as a generalisation of skewness and are all equal to zero for 
zero-mean iid data. Under the null hypothesis that the process is iid, Hinich 
and Patterson (1985) prove that x3 is asymptotically distributed as x? with 
¢(¢ — 1)/2 degrees of freedom for ¢<T°° and recommend using ¢ = T°*. 
Hong and Lee (2003) use a generalised spectral approach to develop a 
diagnostic testing procedure for non-linear and linear models. As with the 
BDS test, this has an appealing nuisance-parameter-free property, and it is 
shown to be asymptotically more efficient. 

Tests are also available for specific non-linear alternatives. Tests against 
ARCH and bilinear alternatives have already been discussed in sections 6.1 
and 6.2 and there is also a fully developed testing procedure against STAR 
models. From Terdsvirta (1994), an LM-type test statistic for the null of 
linearity against an LSTAR alternative can be constructed from the auxiliary 
regression 


p p p 2 p 3 
e = X i Pii + > p1 Ô1jXi—iXt-d + > 1 O2j%t-i%p_g + > 1 O3)%t-i%_g tM 


with the linearity hypothesis being Ho: ô; = 0, for all i and j. To test against an 
ESTAR alternative the same auxiliary regression is estimated, but without the 
fourth-order terms — i.e. we set 53;= 0 a priori. This relationship between the 
two tests leads naturally to a method for discriminating between the two 
types of STAR models (see Terdsvirta, 1994, for details, and example 6.6 
below). Of course, these tests assume that the delay parameter d is known. 
Typically its value will be unknown, and Teräsvirta suggests that it should be 
chosen on the basis of a sequence of LM tests for alternative values of d; we 
choose the value that minimises the p-value of the individual tests in 
the sequence. The auxiliary regression can also be estimated with x, rather 
than e; as the dependent variable, and this may be preferred as it provides a 
direct comparison with the AR(p) model under the null of linearity. Van 
Dijk, Terasvirta and Franses (2002) discuss some extensions to this testing 
procedure. 

Further tests are discussed, within a general econometric context, in 
Granger and Terdsvirta (1993, chap. 6) and in the survey by Terdsvirta, 
Tjostheim and Granger (1994). It should be emphasised, however, that all 
these tests are designed to distinguish between linear and non-linear 
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stochastic dynamics. They are not, as yet, capable of distinguishing non-linear 
stochastic dynamics from deterministic chaotic dynamics, although the 
rejection of linearity may, of course, motivate the investigation of chaotic 
models, as discussed in section 6.4 A test that is claimed to be able to detect 
chaos in noisy data is the Lyaponuv exponent estimator of Nychka et al. 
(1992), which has been subject to rigourous scrutiny in Barnett et al. (1996, 
1997). 


Example 6.6 Non-linearity tests and an ANN model for UK gilt yields 

The residuals from the linear AR(2) model fitted to AR20 in example 6.4 
were used to construct various tests of non-linearity. The LM test for twelfth- 
order ARCH produced the statistic x2, = 114.5, which is obviously signifi- 
cant, and LM tests for bilinearity with R= S= 1 and R= S=2, respectively, 
obtained x? = 6.19 and xj = 12.75, both significant at the 5 per cent level. 
The RESET test with h= 2 (i.e. Keenan’s test) has a p-value of 0.017, Tsay’s 
test a p-value of 0.10, and Terasvirta, Lin and Granger’s test a p-value of 
0.003. Thus, all bar Tsay’s test indicate substantial non-linear dependence in 
the data, the non-rejection of linearity by this test occurring because the 
additional regressors over the RESET test, x;_x;_ and x?_,, are individually 
insignificant. 

Following, for example, Hsieh (1989b), the BDS tests were computed for a 
selection of £ and N values and are shown in table 6.2. All the statistics are 
highly significant, thus again indicating substantial non-linear dependence in 
the residuals. 

Why was an LSTAR(2) model with delay parameter set at d= 1 fitted to 
AR20 in example 6.3? Auxiliary regressions for d= 1 and d=2 suggested 


Table 6.2 BDS statistics for twenty-year gilts 


£=0.5 l=1 L=15 
N WN N WN N WN 
2 7.62 2 7.87 2 7.60 
3 11.63 3 11.09 3 9.97 
4 15.21 4 13.86 4 12.11 
5 20.20 5 16.49 5 13.67 
6 26.51 6 19.00 6 14.94 


Note: l is set in terms of the standard deviation of the residuals from the AR(2) fit — i.e. 2=1 
is one standard deviation. 
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that the former setting was appropriate; after deletion of insignificant 
regressors the auxiliary regression was 


xı = 0.233 x41 — 0.075 x2 + 0.197 x? _; — 0.372 x7 _ 14-2 — 0.218 x? _ X12 
(0.053) (0.045) (0.063) (0.131) (0.082) 


To choose between an LSTAR and an ESTAR model, Teräsvirta (1994) 
suggests the following procedure: (i) test whether all ‘fourth-order’ terms are 
insignificant; (ii) conditional on all fourth-order terms being zero, test the 
joint significance of all third-order terms; and (iii) conditional on all third- 
and fourth-order terms being zero, test the significance of the second-order 
terms. If the test in (ii) produces the smallest p-value, select an ESTAR 
model; if not, choose an LSTAR model. The p-values are found to be (i) 
0.126, (ii) 0.159 and (iii) 0.042; thus, we chose to fit an LSTAR model. 

Methods of testing the adequacy of fitted STAR models are discussed in 
Eitrhem and Terasvirta (1996). To check whether such a model is adequate, 
we can use the approach discussed above for linear models — e.g., to test 
against general ‘neglected’ non-linearity, second- and third-order terms of 
the form x,_ ;x;_ jand x;_ ;X;— ; x; g may be added to the LSTAR model and 
tested for significance. Doing so for the fitted LSTAR(2) model leads to a 
statistic that is significant at less than the 0.01 level. Eitrhem and Terdsvirta 
remark, however, that this does not give us much of a clue as to what model 
we should fit next; given the nature of the residuals from the LSTAR(2) 
model, we decided to fit GARCH(1, 1) errors, leading to the model discussed 
in example 6.4. 

Given the evidence of non-linearity, we also investigated the performance 
of ANNs. The logarithmic changes, Ar20, were used in this exercise, as this 
series is used in a sequence of examples in subsequent chapters. An AR(2), an 
LSTAR(2) with d= 1, and an ANN - an MLP with two inputs, A720,_, and 
A120;_2, and five hidden neurons organised in one layer, denoted ANN(2:5) — 
were estimated over the sample January 1952 to December 2000. The MLP 
was estimated using 1500 training cycles and cross-validation. Table 6.3 


Table 6.3 Within-sample and forecasting performance of three models for Ar20 


RMSE in-sample: RMSE outside-sample: 
1952-2000 1991-2005 

AR(2) 0.0311 0.0289 

LSTAR(2) 0.0309 0.0294 


ANN(2:5) 0.0294 0.0298 
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presents the within-sample root mean squared errors (RMSEs) for the three 
models, where we see that both non-linear models have smaller RMSEs 
than the AR(2) process, the ANN quite considerably so. The three models 
were also used to forecast the remaining five years of data, these outside- 
sample RMSEs also being shown in the table. Now we see that goodness of 
fit is reversed: the linear AR(2) model has the smallest RMSE and the ANN 
the largest, leading to the suspicion that perhaps the ANN had been over- 
trained. 


Example 6.7 The non-linear structure of the VIX 

In this example we apply various non-linearity tests to the VIX series that was 
modelled in example 6.5. As we saw, the best approximation to this process 
was found to be given by a Markov switching AR(1) model. The AIC and BIC 
for various AR models suggest an optimal lag of twelve and five, respectively. 
Since we are interested more in removing linear dependencies than in out- 
of-sample forecasting, we adopt the less parsimonious AR(12) model. The 
SACF and PACF statistics of the AR(12) residuals confirm that all auto- 
correlations have been removed from the series, but the squared residuals 
show strong evidence of serial dependence, with a highly significant Q’(15) 
statistic of 175.0. The Pena and Rodriguez (2005) testing procedure also 
finds non-linearity, since the optimal lag suggested by the BIC for auto- 
regressive models for the squared residuals was clearly non-zero. Evidence of 
ARCH was provided by a highly significant LM test. Although Ramsey’s 
RESET test was only marginally significant using lags h from one to five, 
Tsay’s test was highly significant when implemented with additional second- 
order terms up to five lags. 


Table 6.4 BDS statistics for the VIX residuals 


£=0.5 L=1 £=15 
N WN N WN N WN 
2 17.72 2 17.54 2 16.31 
3 23.97 3 22.69 3 20.43 
4 30.09 4 23.31 4 23.67 
5 37.19 5 31.68 5 26.30 
6 46.17 6 36.35 6 28.53 


Note: £ is set in terms of the standard deviation of the residuals from the AR(12)-GARCH 
(1,1) fit — i.e. 2=1 is one standard deviation. 
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Since evidence in favour of heteroskedasticity is present in the data, we 
augment the AR(12) filtering procedure by using a GARCH(1, 1) process for 
modelling the conditional variance of the VIX. Reapplication of the above 
testing procedures suggests that heteroskedasticity is successfully removed by 
this specification. We now turn to testing for serial independence in the 
ARMA(12)-ARCH(1) residuals using the BDS procedure, the results of 
which are shown in table 6.4. It is clear that there is significant dependence 
remaining in the series even after filtering out linear and GARCH effects. 
Finally, using the Nonlinear Toolkit software distributed freely by Ashley and 
Patterson, we applied the bicovariance and bispectrum tests, adopting the 
specifications suggested by them. The test statistics were both found to be 
significant at the 5 per cent level, thus indicating some kind of non-linear 
dependence in the data, consistent with the results of the BDS test. 
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The choice of unconditional distribution is something that has always 
puzzled academics and practitioners in finance. The standard assumption 
since the 1960s has been that financial prices are geometric Brownian 
motions and, therefore, logarithmic returns follow a normal distribution. 
This assumption has profound implications for a variety of theoretical and 
practical problems in finance, as expected returns and risks in a multivariate 
normal financial world can be fully described probabilistically using just 
means, variances and covariances. Much of finance analysis, such as primary 
and derivative asset pricing, portfolio optimisation and risk management, is 
built upon the assumption of normally distributed returns. The normal 
distribution also has important implications for econometric and statistical 
analysis, since most of the underlying theory is parametric and has been 
developed on the basis of normality. 

The popularity of the normal distribution is easily justifiable from a 
practical point of view because it offers tractability and computational 
simplicity. Furthermore, the normality assumption is supported theoretic- 
ally by the central limit theorem (CLT), which states that the sum of iid 
random variables with finite mean and variance will asymptotically converge 
to a normal distribution. Under these assumptions, the normal distribution 
will offer the best approximation to empirical return distributions in samples 
of reasonable size. An important point to note is that independence of 
returns is not only one of the basic assumptions that leads to the CLT but 
is also consistent with intuitive and theoretical arguments (e.g. Samuelson, 
1965, 1973) that exclude the possibility of ‘free lunches’ in finance. Given 
the advantages of assuming normality and the fact that independence was 
considered to hold reasonably well for a variety of financial return series 
(e.g. see Cootner, 1964, and Fama, 1970), it is not surprising that the normal 
distribution quickly became a standard assumption in finance. The assump- 
tion of independence, which is closely related to the normal distribution 
under the CLT, is not a necessary or sufficient condition for theoretically 
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consistent financial prices, however (under, for example, rational expect- 
ations: LeRoy, 1973, and Lucas, 1978; noise trading: Frankel and Froot, 1988; 
or incomplete knowledge: Goldberg and Frydman, 1996). 

In tandem with these theoretical developments, empirical research on 
returns distributions has also been ongoing since the early 1960s: see, for 
example, the surveys in Kon (1984), Badrinath and Chatterjee (1988) and 
Mittnik and Rachev (1993a), and the recent book by Rachev, Menn and 
Fabozzi (2005). These have almost universally found that such distributions 
are characterised not by normality but by the ‘stylised facts’ of fat tails, high 
peakedness (excess kurtosis) and skewness. Consequently, there have been 
several recent developments in statistics and econometrics that have led to 
considerable advances in the analysis of empirical returns distributions. 

To set the scene for subsequent analysis, section 1 presents an initial 
descriptive analysis of the distributional properties of two typical return 
series, before section 2 reviews two of the most important theoretical 
models for examining return distributions: the stable process and, much 
more briefly since it was analysed in great detail in chapter 5, the ARCH 
process. Section 3 generalises the discussion to consider tail shapes of 
distributions and methods of estimating indices of these shapes, while 
section 4 reviews existing empirical research and offers new evidence from 
our own returns series. Section 5 considers the implications of fat-tailed 
distributions for testing the conventional maintained assumption of time 
series models of returns, that of weak, or covariance, stationarity. Section 6 
switches attention to modelling the central part of returns distributions, 
and section 7 reviews data-analytic methods of modelling skewness and 
kurtosis. The distributional properties of absolute returns are the focus of 
section 8, and a summary and some further extensions are provided in 
section 9. 


7.1 Descriptive analysis of returns series 


The techniques discussed in this chapter are illustrated using two return 
series: (i) the daily returns of the London FT30 for a sixty-year period from 
1935 to 1994, which has previously been analysed in terms of its long- 
memory properties and the profitability of technical trading rules in Mills 
(1996a, 1997b); and (ii) the daily returns of the S&P 500, which has been 
used in earlier examples. Unlike other chapters, however, because the use 
of these series is integral to the development of the techniques, separate, 
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Table 7.1 Descriptive statistics on returns distributions 


T Mean Median Std.dev. Max. Min. Range Skew Kurtosis 
FT30 15,003 0.022 0.000 1.004 10.78 —12.40 23.2 —0.14 14.53 
S&P500 17,054 0.020 0.047 1.154 16.37 —22.80 38.2 —0.49 26.04 


numbered, examples will not be presented; rather, the techniques will be 
illustrated within the main body of the text itself. 

Descriptive distributional statistics are thus presented in table 7.1 and 
graphical representations of these distributions are shown in figure 7.1. The 
empirical densities shown are computed as a smoothed function of the 
histogram using a normal kernel (see Silverman, 1986, chap. 3). Superim- 
posed on the empirical density is a normal distribution having the same 
variance as that estimated from the sample. 

The empirical cumulative distributions are plotted against the cumulative 
reference normal distributions in the form of normal probability or Q-Q 
plots (see Mills, 1990, chap. 3). From this information it is clear that both 
returns distributions diverge substantially from the normal in the manner 
expected: they have fatter tails, are more highly peaked and are skewed. 


7.2 Two models for returns distributions 


The ‘fat-tailed and highly peaked’ stylised fact about financial return series 
was first emphasised by Mandelbrot (1963a, 1963b), who proposed using the 
stable (also known as the stable Paretian, Pareto—Lévy or Levy flight) class of 
distributions, which includes the normal as a special case, to model the fat 
tailed nature of stock returns. Since then, many, but certainly by no means 
all, researchers have found that the stable distribution provides a good fit to a 
wide variety of returns series: see, for example, the references provided by 
Ghose and Kroner (1995). Alternative lines of modelling take the empirical 
returns distribution to be a mixture either of normals or of a normal and a 
stable, or use some other distribution capable of modelling fat tails, such as 
the student-t or the double Weibull distribution. These alternatives will not, 
however, be pursued here, although they have undoubtedly contributed to 
our knowledge of the distributional behaviour of asset returns: see, in par- 
ticular, Mittnik and Rachev (1993a, 1993b), McDonald (1996), and Rachev, 
Menn and Fabozzi (2005) and the references contained therein. 
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The symmetric about zero stable class of distributions is characterised by 
two parameters: a scale factor and the characteristic exponent, which indexes 
the distribution. (We restrict attention to symmetric about zero stable dis- 
tributions so that we may more easily focus on the behaviour of the tails of the 
distributions; allowing asymmetry about a non-zero location measure intro- 
duces two further parameters that merely complicate matters for the purposes 
at hand.) Most attention is focused on the characteristic exponent because, 
since closed-form density functions do not exist for most stable distributions, 
they are usually defined by their characteristic functions, which always exist. 

Suppose {X} is a strict white-noise zero-mean process with probability 
distribution F(X) = P(X < x). The characteristic function of X is defined as 
the Fourier-Stieltjes transform of F(X) 


+00 
p= f eFax) 
where ¢ is real (see Feller, 1966, p. 473). The symmetric (about zero) stable 
characteristic function has the form 


p(s) = exp(—o*|<¢|*) 


where 0 <œ <2 is the characteristic exponent and ø is a scale parameter. 
Samorodnitsky and Taqqu (1994) refer to X being SaS (symmetric a-stable). 
The N(0,2) distribution is obtained when a = 2 and the Cauchy distribution 
is obtained when a= 1. Using the Fourier inversion formula provided by 
Feller (1966, p. 482), the stable probability distribution becomes 


F(X) = 5 f explos) exp(—isX)d 
-00 (7.1) 


+00 
-<| exp(—o"|¢ |") cos(¢X)dc¢ 
T Jo 


As remarked above, when a=2 the distribution is normal and all 
moments are finite, whereas if œ < 2 all moments greater than « are infinite. 
This property produces the fat-tailed (relative to the normal) behaviour of 
stable distributions. A necessary and sufficient condition for a distribution to 
be fat-tailed is that of regular variation at infinity (Feller, 1966, chap. 8.8). 
The stable F(X) of equation (7.1) can be demonstrated to have the property 


(1 — F(sX)) 


a2 U=F@) ~~ 
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which implies that the stable distribution displays a power-declining tail, 
X °, rather than an exponential decline, as is the case with the normal. It is 
in this sense that œ is also often referred to as the tail index. 

Apart from its empirical usefulness, is there any theoretical justification as 
to why the stable distribution should be an appropriate generating process 
for financial data? Mandelbrot (1963b, sect. IID) argues that such a justifi- 
cation arises from a generalisation of the CLT. This establishes that if the 
limiting distribution of an appropriately scaled sum of iid random variables 
exists then it must be a member of the stable class, even if these random 
variables have infinite variance. It thus differs from the usual CLT, which says 
that if the second moments are finite then the limiting distribution is normal. 

This result, for which a proof may be found in Feller (1966), generalises 
the moment requirements of the CLT and thus expands the set of limiting 
distributions. More importantly for our purposes, it also implies that if daily 
returns, say, follow a stable distribution then, since weekly, monthly and 
quarterly returns, for example, can be viewed as the sum of daily returns, 
they too will follow stable distributions having identical characteristic 
exponents. This is known as the stability or invariance under addition 
property of stable distributions. For more detailed technical discussion of 
stable distributions, see, for example, Mandelbrot (1963a, 1963b), Feller 
(1966), Brockwell and Davis (1996), Mittnik and Rachev (1993a, 1993b), 
Samorodnitsky and Taqqu (1994) and Rachev, Menn and Fabozzi (2005). 
These references also deal with the correlation of stable processes over time; 
moving averages of a stable random variable are also stable, as long as certain 
conditions on the coefficients are satisfied. We are therefore not restricted to 
analysing uncorrelated series, and correlated series can be filtered in the usual 
ways — e.g. by fitting autoregressions. 

Correlated stable variables may thus be able to explain a second stylised 
fact of returns: the volatility clustering that is so prevalent in financial data. 
The GARCH class of models discussed in chapter 5, section 5, can also, of 
course, exhibit volatility clustering — i.e. serial correlation of conditional 
variances. For example, the simple ‘ARCH(1) with normal innovations’ 
process for X; is 


Xı = Uo; (7.2) 
where U,~ NID(0,1) and 


o = w+ Xei (7.3) 
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Equations (7.2) and (7.3) can be written as 
X? = wU? + BU; X? |, = By + AX? (7.4) 


say, thus demonstrating the volatility clustering property (X, is serially 
uncorrelated but is not independent). The ARCH(1) process may also exhibit 
fat tails. De Haan et al. (1989) show that the X, of (7.4) regularly varies at 
infinity and has a tail index ¢ defined implicitly by the equation 


r E>) = mPp’ 


where T(.) is the gamma function, as long as, amongst other things, 
B<B = 2e” ~ 3.56856, where v is Euler’s constant. From Groenendijk, 
Lucas and de Vries (1995, fig. 1) we have it that ¢=2 at B=1, =œ at 
B=0 and ¢=0.00279 at B=. It thus follows that, in terms of tail 
behaviour, the stable and ARCH models partially overlap. At B=0 and 1 
(¿=œ and 2) the two models have normal tails, while for 1 < B < £ the 
tail indices can be equal. For 0< 8< 1, €>2, however, X, is covariance 


stationary, has finite variance and there is no stable counterpart, whereas 
for ¢ < 0.00279 there is no ARCH counterpart. Tail behaviour can therefore 
discriminate between the two classes of models in the regions only where 
they do not overlap. 

Although stable distributions have many desirable properties, results 
regarding their empirical appropriateness for describing financial returns 
have been conflicting (see Ghose and Kroner, 1995, and Baillie, 1996). 
Overall, it seems that any supporting evidence fades away as the sampling 
interval of returns increases. Another inconsistency is that estimates of 
variance appear to converge instead of being infinite. Nevertheless, the 
reluctance to replace the normal distribution by the stable family has not 
been based only on empirical or theoretical criteria but also, and perhaps 
most importantly, on the basis of practical convenience. This is because 
stable distributions bring about acute mathematical problems — e.g. they 
have no simple analytical representation, no theory exists for mixing stable 
distributions with different characteristics, distribution parameters are 
notoriously difficult to estimate, standard asymptotic theory is inapplicable, 
etc. Furthermore, the infinite variance property essentially prohibits direct 
applications within the framework of a standard finance theory that requires 
finite second and often higher moments. 
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7.3 Determining the tail shape of a returns distribution 


Ifæ < 2 then, through the property of regular variation at infinity, the tails of 
a stable process are a function of œ and display a power decline. In fact, they 
follow an asymptotic Pareto distribution, so that 


P(X > x) = P(X< =x) = Cx, x>0 


where Cis a finite and positive parameter measuring dispersion. As we have seen 
for the GARCH process, however, the tail index ¢ may be defined for distri- 
butions other than the stable, and for these the index will not equal the char- 
acteristic exponent, although it will determine the maximal finite exponent — i.e. 
the tail index ¢ is such that E|X|*< oo for all O<k<¢. If ¢ <2 then the 
variance of X is infinite and X may be characterised as being generated by a stable 
distribution for which a < ¢. If ¢ > 2 the variance of X is finite, but the distri- 
bution is not necessarily normal and may thus still have fat tails — for example, it 
may be student-t, in which case ¢ defines the degrees of freedom. Distributions 
such as the normal and the power exponential possess all moments, and for 
these ¢ is infinite, and they may be described as being thin-tailed. 

For fat-tailed distributions other than the stable, and which also have the 
property of regular variation at infinity, tail behaviour will also be asymp- 
totically Pareto (this will typically be the case for return distributions: see the 
arguments in Koedijk, Schafgans and de Vries, 1990, for example). Loretan 
and Phillips (1994) formalise this by defining the tail behaviour of the 
distribution of X, to take the form 


P(X >x) = ChxS(1+fp(x)), x >0 
P(X< =x) = Cx (1 + 6,(x)), x>0 


where ¢;— 0 as x— œo, i= R, L. The parameters C and ¢ can be estimated 
using order statistics. If Xa) < X(2)<...< Xer) are the order statistics of 


{X;,}{ in ascending order, then ¢ can be estimated by 


A =n 
č — (= 5 log X(r_j-+1) = bexa) 
j=1 
$ =} 
= = DD (log X(r_j+1) = i) 


j=l 
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ĉis related to the simpler estimator proposed by de Haan and Resnick (1980) 
and modified by Groenendijk, Lucas and de Vries (1995), being approxi- 
mately a weighted average of their estimator 


(= X(T-j+1) = log Zea) =l 
log(s/j) 


evaluated at different values of j< s. An estimate of the scale dispersion 
parameter is 


a é 
CaG/DX es 


Hill (1975) is the original reference for these estimators, which are con- 
ditional ML estimators, while Hall (1982) provides their asymptotic theory. 
To make these estimators operational, the order statistic truncation number 
s=s(T) must be selected. Although we require that s(T) — œo as T— œ, 
various approaches have been taken in empirical applications with a finite 
sample. Typically ¢ is computed for different values of s, selecting an s in the 
region over which ¢ is more or less constant. Koedijk, Schafgans and de Vries 
(1990) use Monte Carlo simulation to choose s such that the MSE of ĉ is 
minimised, while Loretan and Phillips (1994) suggest that s should not 
exceed 0.1T. 

Phillips, McFarland and McMahon (1996), following Hall and Welsh 
(1985), deduce an ‘optimal’ choice of s(T) using the asymptotic theory of 
Hall (1982), from which the MSE of the limit distribution of ĉ is minimised 
by choosing s(T) = [a T? =, where [] signifies the integer part of its argu- 
ment, and where A is estimated adaptively by 


k= 8/2? (T/9) (6-8) 


Here ¢, and ¢, are preliminary estimates of ¢ using data truncations sı = [T”] 
and s,=[T‘], respectively, where 0 <ø < 2/3 <t<1. Phillips, McFarland 
and McMahon (1996) recommend setting o = 0.6 and t= 0.9. Note that, as 
defined, these estimates pertain to the right or upper tail of the distribution 
of X; to estimate the parameters of the left or lower tail, we simply multiply 
the order statistics by —1 and repeat the calculations. We can also estimate a 
single pair of and C estimates by redoing the calculations with absolute 
values of the order statistics. 
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Confidence intervals and hypothesis tests for ¢ and C can be calculated 
using the results, from Hall (1982), that asymptotically 


s?(¢—¢) ~ N(0,2°) 
and 
s'/?(In(T/s))'(C, — C) ~ N(0, C?) 


A hypothesis of particular interest is that of Ho:¢ < 2 against the alternative 
Hy:¢>2, since, from the parameter’s definition, ¢=2 divides off finite 
variance distributions — e.g. the student-t and the ARCH process — from 
infinite variance distributions. 

Constancy of the estimated tail indexes can be examined by using the 
following useful result. Suppose that we obtain estimates ¢) and €2) from 
two independent samples. The statistic 


g0) $ g2 : 
a= (g! Sı + 7@! 2 
where ¢“ and ¢” are hypothesised values of the tail index in the two 
samples, is then asymptotically distributed as x3. Thus, constancy of the tail 
index can be assessed in the following way. Suppose the null hypothesis is 


Hot D ¢ (2) — ¢ and we wish to test at the 5 per cent significance level. 
Solving the quadratic equation 


¢ ? c f 
(gr-}) 9+ (ga) #0 


will then provide the upper and lower bounds for the tail indices that are 
consistent with the null. 

An alternative parameter constancy test is proposed by Loretan and 
Phillips (1994). If ĉe = €) — &? then the statistic 


a2 
Te 


gu? ge? 
(= tir 
is asymptotically distributed as xj. A similar statistic is available to test Ho,< 
CP=C=C using To = O) — C2), and these can be used to assess 


V: —=) 
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whether the parameters are equal across the right and left tails of the dis- 
tribution as well as across time periods. 

There is some evidence, however, provided by McCulloch (1997), that ¢ is 
an upwardly biased estimate of the true value ¢ when the distribution really 
is stable, so these testing procedures should be used with considerable care. 

Given an estimate of the tail index ¢, extreme return levels that are only 
rarely exceeded can be established by extrapolating the empirical distribution 
function outside the sample domain, and this can be useful for analysing 
‘safety first’ portfolio selection strategies (see Jansen and de Vries, 1991, and 
de Haan et al., 1994). A consistent estimate of the “excess level’ x,, for which 


P(X < 2p Xo < õps- Xk Se Lp 


for small p and given k, is given by 


r kr/pT)” 

aS CAD (Xn — Xray) + Xer- (7.5) 
where y = €-!, r=s/2, k is the time period considered and p is the ‘prob- 
ability of excess’ (see Dekkers and de Haan, 1989). This equation can be 
‘inverted’ to obtain the probability p of sustaining a loss of xp. 


7.4 Empirical evidence on tail indices 


A number of papers have investigated the tail behaviour of the empirical 
distribution of foreign exchange rate returns, and, as well as assessing how 
fat-tailed returns are, they also investigate the stability of the distributions 
across different regimes (see Koedijk, Schafgans and de Vries, 1990, Hols and 
de Vries, 1991, Koedijk and Kool, 1992, Koedijk, Stork and de Vries, 1992, 
and Loretan and Phillips, 1994). The general finding from these papers is that 
exchange rate returns are fat-tailed but with ¢ < 4 and, during a variety of 
fixed exchange rate regimes, have tail indices that are in the region 1 < ¢ <2. 
For floating rate regimes, however, ¢ tends to exceed two, which is inter- 
preted as suggesting that a float lets exchange rates adjust more smoothly 
than regimes that involve some amount of fixity. It would also appear that ¢ 
is stable across tails. 

Jansen and de Vries (1991), Loretan and Phillips (1994) and de Haan et al. 
(1994) estimate tail indices for US stock and bond market returns, finding 
that estimates lie in the region 2 < ¢ < 4, so that, although the distributions 
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Table 7.2 Point estimates of tail indices 


FT30 
$ Left tail Right tail Both tails 
25 3.167 (0.633) 3.598 (0.720) 4.377 (0.875) 
50 3.138 (0.444) 2.847 (0.403) 3.253 (0.460) 
75 3.135 (0.362) 3.028 (0.350) 3.357 (0.385) 
100 3.305 (0.330) 3.113 (0.311) 3.082 (0.308) 
320 2.937 (0.164) 2.922 (0.163) 3.111 (0.174) 
3 2.887[298] (0.345)  2.918[317] (0.277) 3.024[405] (0.150) 
S&P500 
j Left tail Right tail Both tails 
25 3.192 (0.638) 4.272 (0.854) 4.445 (0.889) 
50 3.983 (0.563) 3.062 (0.433) 3.917 (0.554) 
75 3.269 (0.373) 3.246 (0.375) 3.672 (0.424) 
100 2.966 (0.297) 3.040 (0.304) 3.554 (0.355) 
320 2.809 (0.157) 2.625 (0.147) 2.925 (0.163) 
3 2.749[335] (0.150)  2.574[365] (0.135) 2.783[474] (0.128) 


Note: $ optimal estimate of s using o = 0.6 and t= 0.9. Actual value of $ reported in [] in each 
column. Standard errors are shown in parentheses. 


are fat-tailed, they appear to be characterised by finite variances. Again, 
estimates of ¢ for stock returns are stable across tails and across subperiods. 
As McCulloch (1997) points out, though, even č values well in excess of two 
could still be consistent with true values of ¢ less than two, so these results, 
while confirming the fat-tailed nature of the returns distributions, cannot be 
interpreted as conclusively ruling out infinite variances. 

Estimates of tail indices for our series are shown in table 7.2. Both return 
distributions have estimated tail indices (for the ‘optimal’ setting of the 
truncation lag) lying in the region 2 < ¢ < 4, with the left tail indices usually 
being a little smaller than the right, although not significantly so on the basis 
of the V; test for constancy across tails (not reported). 

Figure 7.2 plots the left tail shapes of the empirical distribution functions 
of the returns in double-logarithmic coordinates — i.e. it plots logio (P(X < 
— x)) against log ,9x for x > 0. In these coordinates the Pareto distribution, 
for which P(X<—x)=Dx *, where D= C‘, appears as a straight line with a 
slope of —¢. Straight lines of slopes —2 and —4 are plotted against the 
empirical tails to facilitate comparison, the former line because it divides off 
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Figure 7.2 Tail shapes of return distributions 
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Table 7.3 Tail index stability tests 


First half Second half 
ĉ $ fe $ Ve 

FT30 Left 2.78 200 2.99 201 0.53 

Right 2.97 200 3.09 203 0.16 
S&P 500 Left 3.09 207 3.35 208 0.64 

Right 2.48 236 3.48 219 12.41 
Table 7.4 Lower tail probabilities 

FT30 

Return Probability Return Probability 
—0.200 0.00906 —0.193 0.0100 
—0.300 0.00280 —0.246 0.0050 
—0.400 0.00122 —0.428 0.0010 
—0.500 0.00064 —0.950 0.0001 


Note: Calculated using k= 260, $= 298, = 2.887. 


S&P 500 
Return Probability Return Probability 
—0.200 0.02019 —0.258 0.0100 
—0.300 0.00665 —0.333 0.0050 
—0.400 0.00302 —0.598 0.0010 
—0.500 0.00164 —1.383 0.0001 


Note: Calculated using k= 260, $= 335, f= 2.749. 


finite variance distributions from ones with infinite variance, the latter line 
because the value ¢ = 4 is an important dividing point when testing whether 
a series is covariance stationary — as is discussed in the forthcoming section. 
These again make it clear that the return distributions are certainly fat-tailed, 
but there appears to be little support for them following a stable distribution 
and thus having an infinite variance. 

Stability of the tail indices was examined by splitting the sample periods in 
half and computing the V; statistics. These are shown in table 7.3, along with 
subsample estimates of the tail indices. Only for the right tail of the S&P 500 
distribution is there strong evidence of non-constancy, and there is certainly 
no evidence of the subperiod estimates coming close to two. 
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Estimates of extreme levels can also be calculated using the formula (7.5), 
and some calculations are shown in table 7.4. For example, the probability 
that within a given year the FT30 will experience a one-day fall of more than 
20 per cent is 0.009 — i.e. about once in every 110 years — but for the S&P 500 
this probability is 0.02, about once every fifty years. 


7.5 Testing for covariance stationarity 


As we have seen, the assumption of covariance stationarity, that the 
unconditional variance and covariances do not depend on time, is central to 
much of time series econometrics. This assumed constancy of second 
moments is, however, rarely implied by models of optimising behaviour, 
which are typically formulated in terms of restrictions on the conditional 
moments, as in the efficient markets hypothesis, or in terms of relationships 
between conditional moments, as in the CAPM. In financial markets we 
might reasonably expect that unconditional second moments would not 
remain constant over long periods of time; for example, information and 
technology are subject to temporal evolution and can be hypothesised to 
affect the unconditional variance of assets. 

Nonetheless, the assumption of covariance stationarity is a convenient one 
to make and is frequently employed. For example, a GARCH X, will be 
covariance stationary as long as certain conditions are met on the specifi- 
cation of the conditional variance. As we demonstrated in chapter 5, for the 
ARCH(1) process, 8 < 1 is required, while general conditions for a GARCH 
process are given in Bougerol and Picard (1992). Notwithstanding the wide 
popularity of GARCH models, however, considerable empirical evidence has 
been accumulated to suggest that unconditional second moments of returns 
data tend not to be constant, thus throwing into doubt the assumption of 
covariance stationarity. 

Mandelbrot (1963a), in arguing that returns have infinite unconditional 
variance, proposes examining the recursive estimates 
fing = 8" DO t=1,2,...,T 
If û, , converges to a constant as T increases, covariance stationarity would 
seem to be a reasonable assumption, whereas if it wanders around then an 
infinite variance might be suggested (see also Granger and Orr, 1972). Pagan 
and Schwert (1990) remark that this idea is equivalent to the cumulative sum 
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of squares test of Brown, Durbin and Evans (1975), but they point out that it 
assumes that the maintained distribution is normal, which is obviously 
inappropriate when dealing with series of returns. Pagan and Schwert thus 
propose using 


rma IANT] (v2 a 
vlr) = (TAPE (X? = far) 
where 0 < r< 1, [Tr] is the integer part of Tr and 
+20, (1—j/(1+ 1))%, 


is a kernel-based estimate of the ‘long-run’ variance of X?, using the cov- 
ariances Y,..., pı of the series. 

This statistic is a studentised version of the cumulative sum of squares 
statistic, since it standardises the partial sums by a sample-based estimate of 
v rather than its expected value under normality. Inference about y(r) 
depends crucially on the value taken by the tail index ¢ of the distribution of 
X. For €<4 and T— œ, Loretan and Phillips (1994) show that y(r) 
converges weakly to a Brownian bridge (a tied-down Brownian motion: see 
chapter 3, section 4), making the probability that w(r) < c equal to the 
probability that a N(0,7(1 — r)) random variable is less than c. 

For ¢<4, however, y(r) converges to a standardised, tied-down stable 
process. Critical values thus depend in a complicated fashion on ¢, and are 
tabulated in Loretan and Phillips (1994, table 2). For example, for ¢ > 4, the 
5 per cent critical value of w(0.9) is 0.49, whereas for ¢=2.1 it is 0.27; 
nevertheless, while the ¢ > 4 5 per cent critical value of y(0.1) is also 0.49, 
because of the symmetry of the limit distribution, for ¢=2.1 it is 0.66. 
Moreover, the test has decreasing power as ¢ tends to two from above, since 
its rate of divergence from the null becomes much slower because of the 
presence of increasing amounts of outliers. For ¢ < 2 the test is inconsistent, 
which is hardly surprising, as in this case variances are infinite anyway. 

The entire sequence of y(r) values may also be investigated by consi- 
dering scalar-valued test statistics, for example sup,(w(r)), inf,(w(r)) and 
R=sup,(w(r)) — inf,(yw(7)), the latter in fact being identical to Lo’s (1991) 
modified rescaled range statistic discussed in chapter 4, section 3. Again, 
critical values for these statistics are provided in Loretan and Phillips (1994). 
While we have assumed throughout this section that X; is strict white noise 
or, more generally, that it may be generated as the iid innovations from an 
autoregression of an observed series, the above propositions do not depend 
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Figure 7.3 Cumulative sum of squares plots 


crucially on this assumption, which may be relaxed considerably. What 
matters for the purposes of testing for constancy of the unconditional 
variance is the value taken by the maximal finite moment of X, ¢, and in 
particular whether it exceeds four or not. 

Figure 7.3 provides plots of the w(r) sequences for the return series. As has 
been found, both series have tail indices in the range 2 < ¢ < 4. 95 per cent 
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Table 7.5 Cumulative sum of squares tests of covariance stationarity 


sup,(w(r)) inf,Cy(r)) R 
FT30 0.39 — 6.44" 6.83* 


S&P 500 6.60* — 0.35 6.95* 
Note: * denotes significance at 1 per cent level for all values of ¢ > 2. 


critical values for sup,(y(r)) decrease from 1.224 for ¢ > 4 to 0.98 for ¢ =2.1, 
the negatives of these values being the corresponding critical values for 
inf,(y(r)). Appropriate critical values are superimposed on each of the plots. 
Before computing the statistics, however, the return series were pre-filtered 
using an autoregression to remove a non-zero mean and any serial correl- 
ation. Table 7.5 shows values of sup,(Y(r)), inf,(w(1)), and the range statistic 
R, and these confirm the evidence against covariance stationarity. 


7.6 Modelling the central part of returns distributions 


We have concentrated attention so far on the tail shape of returns distri- 
butions because tail observations are more important from both statistical 
(e.g. for assessing normality and dispersion, and for regression and Monte 
Carlo analysis) and financial (e.g. for risk, probability of ruin and option 
pricing) viewpoints. Until recently, there had been little discussion of the 
shape of the central part of such distributions. This lack of attention is a little 
surprising, given that returns distributions are typically characterised by 
being highly ‘peaked’ as well as being too fat-tailed. 

To overcome some of the problems that accompany stable distributions, 
Mantegna and Stanley (1994, 1995) have introduced the truncated stable 
distribution, or truncated Lévy flight (TLF). TLFs have central parts of their 
density function that behave according to a stable distribution, while the tails 
decline exponentially, as in a normal distribution, rather than according to a 
power law. In this simple yet appealing way, TLFs maintain all the advantages 
of the stable distribution in the central part of the density while avoiding the 
problems of infinite variance and infinite higher moments. TLF processes 
have the very interesting property of converging to a normal distribution 
after some scaling interval. Mantegna and Stanley (1995) and Gavridis, 
Markellos and Mills (1999) have found evidence of TLF processes in 
high-frequency financial data and that the point where the distribution 
converges to a normal is around one month. This is consistent with previous 
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studies, which find that returns at intervals longer than one month have 

distributions that are very close to normal. 

The good mathematical properties of truncated Levy flights allow the 
application of standard finance theory and the development of analytical 
solutions (e.g. the option pricing models of Matacz, 2000). If, as suggested by 
the literature, we accept the point of convergence at around one month, this 
implies that investors with horizons of one month and longer face Gaussian 
risks and that conventional risk management and asset pricing is applicable. 
On the other hand, investors at shorter horizons will face non-Gaussian 
fat-tailed distributions and must therefore use high-frequency data and non- 
Gaussian probability tools (e.g. fat-tail estimators and rare event analysis) to 
quantify their risks. 

Mantegna and Stanley (1995) employ a straightforward method to esti- 
mate the characteristic exponent, œ, of the TLF, which is based on the idea 
that the scaling behaviour of the probability of a return to the origin scales as 
At”. More specifically, this methodology can be performed using the fol- 
lowing three steps. 

(i) Calculate logarithmic returns at different sampling frequencies At. The 
lowest frequency of data must be well below the point after which it is 
suspected that the distribution of returns becomes normal. Since this 
point is around a month, only data from tick to, say, intervals of a few 
days should be considered. 

(ii) Estimate the probability that returns for each of the intervals At equals 
the mean (origin) — i.e. P(X= u(At)), where u(At) is the mean value 
of X at interval At. In practice, X ‘equals’ the mean if it is within a 
range of values close to the origin, say within +5 per cent. 

(iii) Regress the logarithms of the probabilities P(X=(At)) on the 
logarithms of Af (minus) the inverse of the slope estimate from this 
regression provides an estimate of the characteristic exponent a. 

The results from applying this regression procedure to our two returns 
series are shown in table 7.6, and both series have characteristic exponents 
that are below two. Of particular interest is the estimate of a for the S&P 500. 
Mantegna and Stanley (1995) use transactions data on this index for the six- 
year period from 1984 to 1989 and obtain an estimate of 1.40. With daily 
data over sixty-four years, we obtain @ = 1.42: a remarkable confirmation of 
the invariance property of stable distributions. 

Recently, the basic TLF model has been extended to develop a number of 
very flexible stable processes for continuous-time modelling with applications 
in option pricing. This research exploits the ability of stable distributions to 


266 The Econometric Modelling of Financial Time Series 


Table 7.6 Estimates of characteristic exponents from the central part of 


distributions 
Slope a R 
FT30 — 0.636 1.573 0.993 


S&P 500 — 0.703 1.423 0.980 


produce discontinuous variations or jumps, behaviour that is consistent with 
the crashes and fat tails observed empirically in financial markets. Merton 
(1976) was the first to propose a jump diffusion model, which augments 
Brownian motion for returns with a compound Poisson process with jump 
sizes that are normally distributed. In this manner, the distribution becomes 
a mixture of normals weighted by Poisson probabilities. The Merton model 
has since been extended in a variety of ways to allow for alternative jump 
specifications and finite moments (for a review of this rapidly expanding 
literature, see Wu, 2006). 


7.7 Data-analytic modelling of skewness and kurtosis 


So far in this chapter we have concentrated on the fat-tailed and highly 
peaked characteristics of return distributions and ignored, both theoretically 
and empirically, the possibility that the distributions may exhibit some 
degree of skewness. Skewness is important both because of its impact on 
portfolio choice and because kurtosis is not independent of skewness; the 
latter may ‘induce’ the former. 

Skewness measures for our series were reported in table 7.1; all are 
negative and significantly different from zero on using the fact that 

(T/6) - skew ~ N(0, 1). We investigate skewness further by constructing 
plots using the order statistics introduced earlier. The median can be defined 
as Xmed= X([7/2]). For a symmetric distribution, the order statistics X(,), 
XT- p) p< [T/2] are equidistant from the median — i.e. 


X(T-p) — Xmed = Xmed — X(T) 


so that a plot of the upper-order statistics X¢r— p) against the lower statistics 

Xp) Should be linear with a slope of —1 if the distribution is symmetric. 
Figure 7.4 shows these ‘upper—lower’ plots, which suggest that the dis- 

tributions are symmetric over a wide range of values, with asymmetry 
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appearing only in the tails of the distributions. Interestingly, the asymmetry 
is characterised by negative skewness, so there is a greater probability of large 
falls in price than large increases. This is what we would expect from our 
knowledge of the episodic nature of market ‘crashes’, but is not what would 
be expected from three-moment portfolio analysis, in which investors should 
have a preference for positive skewness, for they should prefer portfolios with 
a larger probability of very large pay-offs. 

Badrinath and Chatterjee (1988, 1991) and Mills (1995) analyse skewness 
and kurtosis in returns distributions by fitting g, hand gx h distributions (see 
Tukey, 1977, and Hoaglin, 1985). These distributions are non-linear trans- 
formations of the normal. A g-distributed random variable Y, is defined as 


Y= A+ Bg” '(exp(gZ) — 1) 


where Z ~ N(0,1), and is thus a shifted log-normal random variable bounded 
by — g`. An h-distributed random variable is defined as 


Yn = A+ BZexp(hZ’ /2) 


Upper v. lower 
— — Line with slope = -1 
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‘Upper—lower’ symmetry plots 
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Figure 7.4 (continued) 


A positive h will produce thicker tails than the normal. The gx h distribution 
is obtained by multiplying together the g and h distributions. In these def- 
initions g and h are assumed to be constant, but Hoaglin (1985) allows them 
to be polynomial functions of Z? and also recommends that different func- 
tions be allowed for the two tails of the distribution. 

Details of how to fit these distributions may be found in the above refer- 
ences, and Mills (1995), for example, finds that the post-1987 crash London 
stock market indices are characterised by positive skewness and different 
levels of excess kurtosis in the two tails, the right tail being thicker than the 
left. Badrinath and Chatterjee (1988) also find that the right tail of the New 
York market returns distribution is thicker than the left, and both studies 
conclude that the central part of the distribution behaves differently from the 
tails, as we have found here using different techniques. 


7.8 Distributional properties of absolute returns 


Granger, Spear and Ding (2000) and Mills (1996a, 1997a) investigate the 
distributional properties of absolute daily returns, the usefulness of such a 
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transformation for measuring risk being discussed in Granger and Ding 
(1995). The use of absolute returns is suggested by the decomposition 


Xt = |X: 2 sign X; (7.6) 
where 

1 if X, >0 
Xt = 0 if Xt =0 

-1 if xX <0 


Granger, Spear and Ding suggest three distributional properties related to 
the decomposition (7.6): (i) |X;| and sign X, are independent, which will 
be the case if the conditional distributions |X;,||(sign X;=1) and 
|X;||(sign X; = —1) are the same; (ii) the mean and variance of |X;| are equal; 
and (iii) the marginal distribution of |X;| is exponential after outlier 
reduction. This will be the case if both conditional distributions are expo- 
nential. Note that an exponential distribution with parameter e has both 
mean and variance equal to £, a skewness of two and a kurtosis of nine. 

Granger, Spear and Ding (2000) show that all three properties hold for the 
S&P 500 series, which we confirm here in table 7.7, which reports conditional 
means and standard deviations of the absolute returns of the FT30 and the 
S&P 500. We use both the original observations and ‘outlier adjusted’ data: 
this was produced by replacing any observation greater than four times the 
standard deviation (S.D.) by the 4S.D. value having the same sign. We also 
report the ratio of these statistics plus skewness and kurtosis measures. For 
each series, these quantities are shown conditionally for X,>0 and X,<0 
(denoted as + and —), together with the frequencies (given in the row 
labelled ‘Probability’) of those signs occurring. The number of outliers 
reduced in each subsample are also reported. 

For both returns series there is evidence of asymmetry, in that Prob 
(X,>0) exceeds Prob(X;<0), which obviously reflects their underlying 
secular drift. The estimated conditional means and standard deviations of 
the ‘outlier adjusted’ series are approximately equal, and the skewness and 
kurtosis measures are close to two and nine, respectively. This suggests that 
the marginal distributions of the outlier adjusted series are fairly well 
approximated by an exponential distribution. While formal testing of this 
hypothesis is inappropriate, as the series are not independent and identically 
distributed, it is interesting to note that very similar findings have been 
observed for the post-1987 crash data on the London FTSE 100 and Mid 250 
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Table 7.7 Properties of marginal return distributions 


FT30 
Observed ‘Outlier adjusted’ 
+ = + = 
Probability 0.50 0.44 — — 
Mean x 100 0.69 0.73 0.68 0.71 
S.D. x 100 0.74 0.78 0.66 0.70 
Mean/S.D. 0.93 0.93 1.02 1.01 
Skewness 3.55 3.55 2.10 2.01 
Kurtosis 26.87 28.76 8.77 8.13 
Outliers — — 95 50 
S&P 500 
Observed ‘Outlier adjusted’ 
+ = ay = 
Probability 0.52 0.46 — — 
Mean x 100 0.72 0.73 0.71 0.76 
S.D. x 100 0.85 0.94 0.74 0.81 
Mean/S.D. 0.86 0.82 0.96 0.93 
Skewness 4.30 4.82 2.49 2.30 
Kurtosis 37.97 59.13 11.13 9.45 
Outliers — — 138 74 


indices (Mills, 1997a). For both series the first property, that |X;| and sign X, 
are independent, is confirmed using a Kolmogorov—Smirnov test. It would 
thus appear that the properties of absolute returns suggested by Granger and 
Ding (1995) do indeed hold for this further selection of speculative price 
series (but see Rydén, Terdsvirta and Asbrink, 1998, for further research in 
this area). 

Granger, Spear and Ding (2000) argue that, if |X;| is exponential, then it is 
reasonable to expect that the pair |X;|, | X;—«|, will be jointly exponential. This 
joint distribution has the properties that the marginal distributions are each 
exponential and that the conditional mean E(|X;|||X;—x|) is a linear function 
of |X;—,|. This suggests that linear regressions of absolute returns on lagged 
absolute returns may have some predictive power, although the results 
presented by Granger, Spear and Ding (2000) and Mills (1996a, 1997a) show 
that any predictability is quite weak. 
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7.9 Summary and further extensions 


We have surveyed a wide range of techniques for modelling the distribution 
of financial returns. The ‘stylised facts’ that come out of both our own 
empirical analysis and of the others surveyed here may be summarised thus. 
Returns are certainly not normally distributed but are characterised by fat 
tails and peakedness, both of which are unsurprising, and negative skewness, 
which is rather more surprising. It is only in the tails that skewness appears, 
however, with much of the distribution being symmetric. Although sym- 
metric, this central part of the distribution is not normal but tends to 
approximate a stable distribution. The tails are not stable, however, but are 
exponentially declining, being consistent with a finite variance. While having 
a finite variance, returns do not generally appear to have a constant 
(unconditional) variance, or, indeed, covariances — covariance stationarity is 
rejected for all series. Absolute returns, when adjusted for outliers, 
approximate to an exponential distribution. 

These stylised facts suggest two broad areas of further research. The 
breakdown of covariance stationarity over all but the shortest of sample 
periods casts doubt on the validity and empirical accuracy of models that 
assume that the unconditional variance is constant — e.g. ARCH models. This 
suggests that extensions to time series models that explicitly incorporate 
error distributions that can effectively model outlier activity and time- 
varying unconditional variances would be very useful. 

The possibility that time series may have infinite variance has been taken into 
account for certain procedures discussed in earlier chapters. Phillips (1990) 
considers unit root tests under the assumption of infinite variance errors and 
shows that the non-parametric tests of chapter 3, section 1.4, continue to 
operate without any modification. Runde (1997) shows that the asymptotic 
distribution of the Box—Pierce Q* statistic, introduced in example 2.1, is no 
longer x° under an infinite variance assumption. Rather than using Q*, he 
proposes scaling it by the factor T°~*’“/(logT)**, and provides simulated 
critical values for the new statistic. Scaling by this factor will reduce the size of 
the statistic for typical situations — e.g. T< 10,000 and a > 1.5. Some advances 
have been made in combining infinite variances with both short- and long- 
memory ARMA processes (Cioczek-Georges and Taqqu, 1995, and Kokoszka 
and Taqqu, 1994, 1996), but further research is clearly required. 

It is also important that theoretical models of rational economic behav- 
iour continue to be developed that can explain the outlier activity that leads 
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to these common and distinctive distributional properties of financial 
returns. Some extensions of this type are discussed in McCulloch (1996), for 
example. Related to this, it is clear that evaluations of models should not rely 
on tests that are based on normal approximations. For example, trading rules 
should not be evaluated using tests that assume normal, stationary, and 
time-independent distributions. The use of bootstrap methodology (see 
Efron and Tibshirani, 1993) is one possibility, and two examples are Brock, 
Lakonishok and LeBaron (1992) and Mills (1997b). 

Another interesting approach that is receiving increasing attention is the 
use of non-parametric or data-driven distributions (for a general descrip- 
tion, see Silverman, 1986) for approximating the empirical distribution of 
financial returns. Non-parametric distribution estimation has made use of 
powerful new technologies, such as bootstrap simulation (Efron and Tib- 
shirani, 1993) and neural networks (Modha and Fainman, 1994). Although 
most approaches estimate non-parametric distributions on the basis of 
independence, recent extensions relax this assumption (e.g. the ‘moving- 
blocks’ bootstrap). Applications of non-parametric distributions in finance 
are growing rapidly, and most studies are concerned with problems in 
econometric analysis and estimation (see Horowitz, 2001). 

The overall conclusion from empirical studies of the unconditional dis- 
tribution of financial returns is that no single parametric model dominates 
all situations (e.g. for exchange rates, see Boothe and Glassman, 1987; for 
stock market data, see Kon, 1984). Theoretically, non-normality could be 
due to a combination of reasons that include non-stationarity, mis- 
specification and pre-asymptotic behaviour. In finite samples, however, 
unconditional distributions will always have fatter tails than conditional 
distributions when the data have some type of conditional dependence, 
especially if this is non-linear. It must be emphasised that uncorrelatedness 
of returns is not sufficient to prove independence, especially in view of the 
unconditional non-normality, since it is possible that non-linear predict- 
abilities exist. Although it may be the case that the normal distribution and 
the underlying CLT assumptions do not hold for actual returns data and that 
other distributions offer better fits, one must be wary of atheoretical solu- 
tions of convenience. Moreover, if the deviations from normality are highly 
irregular (e.g. due to outliers, structural breaks, regime switching, etc.) or 
cannot be isolated from the data, then the normal distribution may provide a 
relatively good approximation, especially from an out-of-sample perspective. 

In many cases the normal distribution will provide a good approximation 
if returns are independent but not identically distributed. In general, even if 
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the underlying distribution is non-normal, the standard deviation can still be 
used, provided that there are only small probabilities of extremely high and 
low returns (see Levy and Markowitz, 1979). Even if second moments are 
non-constant, these can be measured accurately if the conditional distri- 
bution of returns is not too fat-tailed and volatility changes are smooth 
(Nelson, 1990b). Covariances can be used not only for studying multivariate 
normal distributions but also for the more general class of joint elliptic 
distributions. Although statistical inference will be problematic, correlations 
are also applicable when assessing relationships between stable distributions. 


8 Regression techniques for 
non-integrated financial time series 


The analysis of the general linear regression model forms the basis of every 
standard econometrics text and we see no need to repeat such a development 
here. Models relating to financial time series, however, often cannot be 
analysed within the basic framework of ordinary least squares regression, or 
even its extensions incorporating generalised least squares or instrumental 
variables techniques. This chapter therefore develops a general theory of 
regression, based on the original work of Hansen (1982), White (1984) and 
White and Domowitz (1984), that builds upon the univariate time series 
techniques of the previous chapters and is applicable to many, but by no 
means all, of the regression problems that arise in the analysis of the rela- 
tionships between financial time series. 

Section 8.1 thus sets out the basic dynamic linear regression model, while 
section 8.1 incorporates ARCH error effects into the framework. Mis- 
specification testing is the topic of section 8.3, and section 8.4 discusses 
robust estimation techniques and generalised method of moments (GMM) 
estimation, which may be used when the standard assumptions of regression 
are found to be invalid. The multivariate linear regression model is briefly 
introduced in section 8.5. This paves the way for more general multivariate 
regression techniques, and the remaining sections of the chapter deal with 
vector autoregressions and its various extensions, including a discussion of 
the concepts of exogeneity and causality. 


8.1 Regression models 


8.1.1 Regression with non-integrated time series 


We now extend our modelling techniques to consider relationships between 
a group of time series {z,}. We begin by analysing the simplest case in which a 
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single variable y, is postulated to be a (linear) function of past values of itself 
and present and past values of a vector of other variables x = (Miao NRE): 
Here z, = (y, x7) and, for the observed realisation {z}, the model can 


be written as 


Vt = Mo + Fo Qiyi—i + XO Baxi Fu, m+1<t<T (8.1) 
or 
Yi = Xip + uy 


In matrix form, we have 
y=Xßf+u (8.2) 


where 


Y = Wmi -3 YT) 


= a aT 
X, = (bieen A: ike) 


u= (Um, peg ur)! 


7 
B = (010, 01,.--,Qm,Bo,--- Bm) 


Bi = (Bi,--- Bir); i=0,...,m 


To estimate the parameters of interest contained in the vector f, certain 
assumptions are needed about {z,}; and the error process {u,}/. We begin 
by assuming that {a}, is a normally distributed (weakly) stationary sto- 
chastic process. Noting that z, is of dimension k+ 1, extending the statio- 
narity requirements for a univariate series given in chapter 2, section 1, to 
this multivariate setting yields 


E(z;) DWM (Hy, His- : < Lr) 
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and 
Cov(2;,2;) = E(z; — #)(2-|:-) — æ) =V(|t-s|), 1<ts<T 


so that the z,s have identical means and variances and their temporal covar- 
iances depend only on the absolute value of the time difference between them. 

Note, however, that the assumption of stationarity alone is not sufficient 
to obtain an operational model of the form (8.1). This is because the non- 
zero covariances allow dependence between, for example, zı and Zr, implying 
that the lag length m in (8.1) should strictly be set at t— 1, so that the number 
of unknown parameters in the model increases with T. We thus need to 
restrict the form of the dependence in {z} , and, to this end, the following 
concepts are important (see White, 1984, and Spanos, 1986, for detailed 
formal discussion): z, is said to be asymptotically independent if 


T(t) 0 as t=|t—s|— œ 


and ergodic if 
foal 
Jim, GS ro) a 


It is conventional to make either the assumption of asymptotic inde- 
pendence or the somewhat weaker assumption of ergodicity (cf. the uni- 
variate development in chapter 2), and this allows us to restrict the memory 
of the process {z} and hence to fix the maximum lag at an appropriate 


value — m, say — in (8.1). The error {up is defined formally as 
Ut = y= E(ydyl x?) 

where 

Yii = (P1: Y2- i) 

and 

= (x nX) 


We assume that it satisfies the following properties: 


E(u) = E(uly? 1x7) =0 


E(mus) = E{E(uusly,1x,)} = { 0 
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These two properties define u, to be a martingale difference relative to the 
‘history’ (y?_,,x?) and to have a finite variance — i.e. it is an innovation 
process. Note that the assumption of asymptotic independence implies that 


m—i 


the roots of the polynomial z” — 5*"", œ;z"™' = 0 are all less than unity in 


absolute value. 

Assuming X to be of full rank K= (m+ 1)(k+ 1), so that X'X is non- 
singular, and u, to be NID(0, o°), the LS (and approximate ML) estimator of 
B obtained using the sample {z,} is 


Êr = (X'X) `X'y 

while the LS and approximate ML estimators of o° are 
67.=(T—m) ‘ala 

and 

õi =T'a'a 


respectively, where u = y — Xf; is the regression residual vector. (The ML 
estimators are said to be approximate because the initial conditions 
involving the observations y,,...,Vm are ignored.) Since us is not inde- 


pendent of future ys, E (X'u) # 0, and so Êr is a biased estimator of f: 
E(B, — B) = E((X"X) "X u) £0 


Nevertheless, assuming Gr= E(X'X/T) to be uniformally positive definite 
and since E(X] u) = 0, then, under certain conditions concerning the 


magnitude of E(X'X), Ê r can be shown to be a strongly consistent estimator 
of $, as indeed is ô}. of o°. The estimators are also asymptotically normal: 


GZ T?(Êr — B) & N(0,1) 


NIH 


T1(62. — o?) £ N(0,20') 


(for formal derivations of these results, see, for example, White, 1984, and 


Spanos, 1986). Gy can be consistently estimated in this case as On (X'X) F 


this being the conventional formula in LS regression. 
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These results can be extended to allow both z, and u, to exhibit time 
dependence and heterogeneity simultaneously. Specifically, the memory 
requirement can be relaxed from that of stationarity and asymptotic inde- 
pendence (or ergodicity) to one of strong mixing, as discussed in chapter 3, 
section 2.3 (see conditions (3.10)). White (1984, exercise 3.51, theorem 4.25) 
provides a formal statement of the required conditions and shows that, in 
these circumstances, Êr is still consistent and asymptotically normal, 
although we now have 


D7 T?(Ĥr — F) © N(0,1) 
where 
= 


Dr = (X'X/T) 'Vr(X'X/T) 


Vr is an estimate of V >= E(X' uu! X/T), which can be expressed in terms of 
individual observations as 


F 
v(m Exu) = Eater 
T t=1 
gel rT 
+T US 7S) E(X] umul X, tX] rtu X,) 
t=1 t=t+1 
T 
ai SOV Ke) 
t=1 
i=] 7 
+T NO XO (Cov(X} u, X] t-r) + Cov(X}_,ty-r,X} u)) 
t=1 t=t+1 


thus revealing that V ris the average of the variances of X} u, plus a term that 
takes account of the covariance between x; u, and Xl Ais for all t and T. 

With our mixing assumptions, the covariance between x} u, and X oie 
goes to zero as T — oo, and hence Vry can be approximated by 


T 
Vr=T'S > E(X] uu; X,) 


t=1 


n T 
+T Y SS E(X; mu, Xir + X; rtrt X,) (8.3) 


t=1 t=t+1 
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for some value n, because the neglected terms (those for which n < t < T) 
will be small in absolute value if n is sufficiently large. 

Note, however, that, if n is simply kept fixed as T grows, the number of 
neglected terms grows, and may grow in such a way that the sum of these 
terms does not remain negligible. The estimator Vr, obtained by replacing ur 
by i; in (8.3), will then be a consistent estimator of Vy (and hence of Vy) ifn 
does not grow too rapidly as T grows; specifically, we must ensure that n 
grows more slowly than T3. Unfortunately, although it is consistent, Vr need 
not be positive semi-definite in small samples. For this reason, we may use 
the Newey and West (1987) modified estimator, first introduced in chapter 
3, section 2.3, and defined here as 


T n 


Vr=T'S) (X) tit) X,) +T Y (1 = (t/(n +1) 


t=1 fal 


Ẹ 
x XO (X; nô X, + Xj_,therit X,) 
t=t+1 


8.1.2 Hypothesis testing 


As is traditional, we consider hypotheses that can be expressed as linear 
combinations of the parameters in f: 


R£ =r 


where R and r are a matrix and a vector of known elements, both of row 
dimension q, that specify the q hypotheses of interest. 

Several different approaches can be taken in computing a statistic to test 
the null hypothesis RJ =r against the alternative RB Æ r; we consider here 
the use of Wald, Lagrange multiplier and (quasi-)likelihood ratio statistics. 
Although the approaches to forming the test statistics differ, in each case an 
underlying asymptotic normality property is exploited to obtain a statistic 
that is asymptotically distributed as x*. Detailed development of the theory 
of hypothesis testing using these approaches may be found in Godfrey 
(1988). 

The Wald statistic allows the simplest analysis, although it may not be the 
easiest to compute. Its motivation is the observation that, when the null 
hypothesis is correct, RB, should be close to RJ =r, so that a value of 
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RB; —r far from zero should be viewed as evidence against the null 
hypothesis. To tell how far from zero RB, — r must be before we reject the 
null hypothesis, we need to determine its asymptotic distribution. White 
(1984) shows that, if the rank of R is q< K, then the Wald statistic is 


Wr = T(Rfr — r) ÊF (RÊ, - r) 2% (8.5) 
where 
Ôr = RD;R' =R(X'X/T) 'V7(X'X/T) `R" 


This version of the Wald statistic is useful regardless of the presence of 
heteroskedasticity or serial correlation in the error u because a consistent 
estimator Vr is used to construct Ôr. In the special case where u is white 
noise, Vr can be consistently estimated by & 67. (X'X/ T), and the Wald 
statistic then has the form 


Wr = T(RÊr — 1) (R((X™X)/7)'R") (RB — 1) /6} 


which is simply q times the standard F-statistic for testing the hypothesis 
RJ =r. The validity of the asymptotic X distribution for this statistic, 


however, depends crucially on the consistency of the estimator 67, (X'X / T) 


for Vg if this Vy is not consistent for V7, the asymptotic distribution of this 
form for Wr is not Xp and hence failure to take account of serial correlation 
and heterogeneity in the errors will lead to inferences being made using an 
incorrect distribution. 

The Wald statistic is the most convenient test to use when the restrictions 
Rf =r are not easy to impose in estimating J. When these restrictions can be 
imposed easily, the Lagrange multiplier statistic is more convenient to 
compute. The motivation for the LM statistic is that a constrained LS esti- 
mator can be obtained by solving the first-order condition of the Lagrangian 
expression 


L= (y - XB) (y -XB)/T+ (RB -r)'h 


The Lagrange multipliers à give the shadow price of the constraint, and 
should therefore be small when the constraint is valid and large otherwise. 
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The LM test can thus be thought of as testing the hypothesis that 4 = 0. 
Solving the first-order conditions for à yields 


Är = 2(R(X'X/T) RT) (RÊ, — r) 


so that Är is simply a non-singular transformation of RB, =r 
Also provided by solving the first-order conditions is the constrained LS 
estimator fr, given by 


ka A Zj we 

Ër = Êr — (X' X/T) R'Ar/2 

from which can be calculated the constrained estimator of o°, 
oe pE oe 


6, =(T-—m)i'i 


where ü = y — XB, are the residuals from the constrained regression. The 
LM test statistic is then defined as 


LMr = Thy Ark © x? (8.6) 
where 


Ar = 4(R(X'X/T)'R) R(K™X/T) 'Wr(X"X/T) R" (R(x'x/T) RT) 


V being computed from the constrained regression. Note that the Wald and 
LM statistics (8.5) and (8.6) would be identical if Vy were used in place of 
Vand, indeed, the two statistics are asymptotically equivalent. 

As we have seen, when the errors u, are NID(0, o°) the LS estimator Êr is 


also the ML estimator. When this is not the case, Êr is said to be a QML 


estimator. When f r is the ML estimator, hypothesis tests can be based on the 
log-likelihood ratio (here LR) 


LRr = log(L(B;, ér) /L(Br, ér) ) 


where 


t=m-+1 


T 
L(B,o) = exp [rose — Tlogo —4 `> (y — x |) 
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is the sample likelihood based on the normality assumption. Simple algebra 
yields the following alternative form of the statistic, 


LRr = (T/2)In(67./67) 


and it can be shown that —2LRr is asymptotically equivalent to the Wald 

statistic (8.5) and thus has a x; distribution asymptotically, provided that 

oF (x'X J. T) is a consistent estimator of V z. If this is not true, then —2LR, is 
: 2 

not asymptotically xz. 


So far we have considered linear hypotheses of the form Rf = r. In general, 
non-linear hypotheses can be conveniently represented as 


Ho: s(B) =0 


where s is a continuously differentiable function of p. Just as with linear 
restrictions, we can construct a Wald test based on the asymptotic distribution 
of s( Br), we can construct an LM test or we can form a log-likelihood ratio. 
Assuming that the rank of As(f) =q< K, where As is the gradient (deriva- 
tive) of s, then under Ho: s($) =0, the Wald and LM test statistics are given 


by equations (8.5) and (8.6) with s(B,-) and As( Êy) replacing RB, — r and R, 
respectively, in (8.5) and s( Br) and As( Br) similarly replacing these terms in 
(8.6). 


8.1.3 Instrumental variable estimation 


We have so far considered only (ordinary) LS estimation of the model (8.1). If 
the assumption E(X] ur) = 0 does not hold, but a set of / instrumental vari- 
ables (IVs), say W; = (wis, . - . , Wir), are available such that E (Ww) Ur) = Oand 
E (W'X / T) has uniformly full column rank, then we can form the IV estimator 


Br = (x'WP,W'x) “X'wPrw'y 

where W= (W,,,.,--->W7) and Pr isa symmetric {x | positive definite 
norming matrix. For example, with W=X and Pr = (W'W / T)’, 
Br = Êr» while, for any W, choosing P = (WIW / T) = yields the two-stage 


least squares estimator. Analogous to the results for the LS estimator, if W, is 


also mixing, then Br is strongly consistent and 


Dy T'(Br — 8) © N(0,1) 
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where now 


Dr = (X'WP;W'X/T?) | (X'W/T)PrV7Pr(W'X/T) (X'WPrW'X/T?) 


5 ty f ! P- a1 
So far we have let Py be any positive definite matrix. By choosing Pr = Vy , 
however, an asymptotically efficient estimator is obtained for the class of IV 
estimators with given instrumental variables W — i.e. 


A =1 2 
j= (x w, wx) X'W, Wy 


is asymptotically efficient within the class of IV estimators f T- 

How should we choose the set of instruments W? It can be shown that the 
asymptotic precision of the IV estimator cannot be worsened by including 
additional instruments. There are situations, however, when nothing is 
gained by adding an additional instrument: this is when the additional 
instrument is uncorrelated with the residuals of the regression of X on the 
already included instruments. 

When serial correlation or heteroskedasticity of unknown form is present 
in (8.1), there may, in fact, be no limit to the number of instrumental 
variables available for improving the efficiency of the IV estimator; functions 
of X and W are possible instruments. In the absence of serial correlation or 
heteroskedastcity, however, it is possible to specify precisely a finite set of 
instruments that yield the greatest possible efficiency: they will be those 
functions of W, that appear in the conditional expectation of X, given W, 


8.1.4 Generalised methods of moments estimation 


Suppose we have a general, possibly non-linear, model that we can write as 
Uur= fY» Xp 0), where 0 is an sx 1 vector of parameters and u, can be both 
serially correlated and heteroskedastic. Our model tells us only that there is 
some true set of parameters Ôo for which u, is orthogonal to a set of 
instruments W, so that 


E(W;t,) = E(W; ft, Xe, 9)) = E(m(y;, X+, W+, 90)) =0 


The estimation technique known as generalised methods of moments 
(Hansen, 1982) focuses on these orthogonality conditions: see Hamilton 
(1994, chap. 14) and Campbell, Lo and MacKinlay (1997, appendix) for 
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detailed treatments and Johnston and DiNardo (1997, chap. 10) for a text- 
book discussion. 

If we define a vector my (0) containing the sample averages of the 
elements of m( ), 


T 
mz7(0) = ‘ee ae m(y;, X, We, 0) 


GMM minimises the quadratic form m7(0)'Arm7(9), where Ar is a 
weighting matrix, leading to the first-order condition 


Mr (êr) T ArMr (ôr) =) 


where 

Mr(@) = Omr(@) /00 

The asymptotic distribution of Or is 

Dy'r T? (Or — ) © N(0, 1) 

where 

Dur = (MLArM7) M}LWÛm, rW  M(M}ArMr) ` 


with Vmr being defined analogously to Vr in (8.4). As in the IV case dis- 
cussed in section 8.1.3, an asymptotically efficient estimator of Or is obtained 
by choosing the weighting matrix as Ar = Ve When f(y, X» 0) is linear 
then it is straightforward to show that the GMM estimator is the IV esti- 
mator and, if W =X, it is the LS estimator. 


Example 8.1 Forward exchange rates as optimal predictors of future spot rates. 
An important illustration of these estimation techniques is found in the 
analysis of foreign exchange markets, where the efficient markets hypothesis 
becomes the proposition that the expected rate of return to speculation in 
the forward market, conditioned on available information, is zero. Hansen 
and Hodrick (1980) test this ‘simple’ efficiency hypothesis in the following 
way. Let s, and f,, be the logarithms of the spot exchange rate and the 
k-period forward rate determined at time t, respectively. Since s,.,—f,,,1s an 
approximate measure of the rate of return to speculation, the simple efficient 
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markets hypothesis is that 
fik = E(sr+4|®r) 


where ®, is the information set available at time t. This implies that the 
speculative rate of return, y;4.=S;4%—f,% should be uncorrelated with 
information available at time t, for example, in the regression of the return 
on a constant and two lagged returns 


Vitk = Ag + OY, + A2Yt—1 + Ut+k 


the as, i=0, 1, 2, should all be zero. Assuming that s, and fṣ% and hence yp 
are mixing and that E(y,;—ju;+,)=0 for j>0, which is easily verified, LS 
estimation provides consistent estimates of the as. 

In the present circumstances, however, the forecast error up+% = Yt+k — 
E(y:+x|®,) will be serially correlated, so the usual estimated covariance 
matrix will be inconsistent. 

This serial correlation arises from the fact that the realised values of the 
spot exchange rate s;41, St+2- . -> S¢4, are not known when the forward rate 
fak is set at time t, so that the corresponding k-period ahead forecast errors 
Uttk-j = St+k-j — fr-jo J= l, 2,..., k— 1, are not observable. Since u+, 


Up 42)+-+5 Ur+k— are not part of the available information set, we cannot rule 
out the possibility that E(u] Ur+k-j) #0, 1<j<k—1 or that 


Cov (urk, U+k-j) £0, j=1,2,...,k-1 


On the other hand, the preceding k-period forecast errors u,,,—; for j> k 
are observable. Efficiency thus requires E( uy+44|Ur+k-j) =0,j>k, and 


hence 
Cov( tri, U4+e-j) =0, j>k 


With our mixing assumptions concerning s; and fips Ups n= St+k— fix Will 
also be mixing, and combining the above covariances shows that the forecast 
errors can be thought of as being generated by an MA(k— 1) process. 

Can we use generalised least squares procedures to make inferences about 
the œ;s? The answer is ‘no’, because such techniques require the regressors to 
be strictly exogenous, which means that E(u+k|.--, 1, Yt, Vet1,---) = 0, 
i.e. that future y values would be useless in determining the optimal forecast 
for y;+, (strict, and other forms of, exogeneity are formally discussed in 
section 8.5). This is clearly inappropriate, as such values would provide 
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useful information for forecasting future rates of return. The use of regres- 
sors that are not strictly exogenous renders GLS techniques inconsistent, 
because the transformation used to eliminate the serial correlation in the 
residuals makes the transformed residuals for some particular period 
linear combinations of the original residuals and their lagged values. 
These, in turn, are likely to be correlated with the transformed data for 
the same period, since these include current values of the variables in the 
information set. 

One way of avoiding these difficulties is to choose the sampling interval to 
equal the forecast interval, i.e. to set k= 1, in which case the forecast errors 
will be serially uncorrelated. This procedure of using non-overlapping data 
clearly does not make use of all the available information: T(1 — K?) 
observations are sacrificed. In the present application weekly observations 
are typically used with k set at thirteen (three-month forward exchange rates 
being readily available). Using non-overlapping data — i.e. sampling only 
every thirteen weeks — would thus throw away over 90 per cent of the 
available observations. 

The complete data set can be used if we adjust the covariance matrix of 


A A KAT a : . ` 
B = (Go, @1,@2) in the appropriate fashion. As we have shown, a consistent 
covariance matrix is 


Dr = (X'X/T) 'Wr(X'X/T) 


where now the columns making up the X matrix contain a constant and the 
two lagged values of y,;,. In this application we have available an explicit 
expression for Vr, namely Vr = T-!X' @X, where, from the fact that the 
residuals îs} follow an MA(k — 1) process, the elements of the Tx T 


symmetric matrix © have the form 


Oj = RG), i=1,2,...,.T-k+1, f=0,1,...,k-1 


Orji = Orit) 


where 
È 
RG) = T! ` Up +k Up tej 
t=j+1 


and Ô;j = 0 otherwise — i.e. Ô is ‘band diagonal’, the bandwidth being 
2k-1. 
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The hypothesis of market efficiency is #=0 and, in the framework of 
section 8.1.2, R=I;,r=0 and Qr = Dr. The Wald statistic, for example, for 
testing this hypothesis takes the form 


Wr= TID Êr A re 


Hansen and Hodrick (1980) estimate regressions of this type for weekly data 
on spot and three-month (k= 13) forward exchange rates for seven cur- 
rencies (expressed in US cents per unit of foreign currency) from March 1973 
to January 1979, and for three currencies relative to the pound sterling for 
certain episodes after the First World War, in this case using one-month 
(k=4) forward rates. Their findings indicate that the simple efficiency 
hypothesis is ‘suspect’ in both periods, but they offer a variety of reasons why 
this may be so, emphasising that rejection of the hypothesis $ =0 cannot 
necessarily be identified with inefficiency in the foreign exchange market, as 
certain intertemporal asset allocation and risk considerations are ignored in 
this formulation of the efficient markets hypothesis. 


8.2 ARCH-in-mean regression models 


8.2.1 The GARCH-M model 


The estimation techniques developed above are applicable when little is 
known about the structure of the serial correlation and heteroskedasticity 
present in the errors in model (8.1). On certain occasions, however, it may be 
possible to specify the form of these departures from white noise, and a 
specification that has proved to be particularly useful in financial applica- 
tions is the (G)ARCH-in-mean, or GARCH-M, model proposed by Engle, 
Lilien and Robbins (1987), and employed initially by Domowitz and Hakkio 
(1985) for examining risk premia in the foreign exchange market and by 
French, Schwert and Stambaugh (1987) to model stock return volatility. 

Bollerslev, Chou and Kroner (1992) provide many further references to 
early GARCH-M applications in finance, these often being attempts to 
model the linear relationship that emerges as a consequence of the inter- 
temporal CAPM of Merton (1973, 1980). Unfortunately, although most asset 
pricing models imply that conditional excess market returns should be 
proportional to the market conditional variance, empirical evidence using 
various formulations of the GARCH-M model has been far from conclusive 
(see, for example, the studies cited by Li et al., 2005). 
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The GARCH-M model extends the GARCH family developed in chapter 5, 
section 5, to the regression framework of equation (8.1): 


Yı = A+ Yos Qiy + S P:Xt-i + ôo + u (8.7) 
Ut = Et — ae OE ti (8.8) 


Efel) =o7 = yo + Yo. vie it Do. pihit VE, (8.9) 


Here we allow the serially correlated errors u, to be modelled as an MA(n) 
process (equation (8.8)), and the conditional variance o? (conditional upon 
the information set at time t—1, ®,_,) enters the ‘mean’ equation (8.7) and 
depends itself (equation (8.9) upon a vector of explanatory variables &,. 
Typically, à is set at one or two, so that either the conditional standard 
deviation or variance is included in the mean equation. Under the assumption 
that the £, are NID(0, o°), QML estimates of the GARCH-M model given by 
equations (8.7) to (8.9) can be obtained by maximising the likelihood function 
using, for example, the BHHH algorithm analogous to that discussed in 
chapter 5, section 5.5. 

There are some complications, however. For example, the information 
matrix is no longer block diagonal, so that all parameters must be estimated 
simultaneously, unlike the GARCH set-up, where the block diagonality of 
the information matrix allows estimates of the parameters of the mean and 
conditional variance equations to be obtained from separate iterations. 
Simulation evidence provided by Dotsis and Markellos (2007) on the finite 
sample properties of ML estimates of GARCH-M parameters suggests that 
biases are likely to occur even for sample sizes as large as 3000 observations. 
Moreover, the parameters that reflect the strength of association between 
returns and conditional variances are subject to the most severe biases. 

If it is preferred, the alternative assumption that the £, follow a stand- 
ardised t-distribution may be employed to allow more adequate modelling of 
the fat tails often found in the observed unconditional distributions of 
financial time series. Baillie and Bollerslev (1989), for example, provide the 
relevant expression for the log-likelihood function. 

The standard theory used to justify a relationship between returns and 
the conditional variance implies a linear functional form. It has been sug- 
gested, however, that this relationship could take any shape (see Backus 
and Gregory, 1993). Motivated by this, attempts have been made to derive 
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semi-parametric GARCH-M models, in which the conditional variance is 
modelled using a GARCH process while the conditional mean is estimated 
non-parametrically using a flexible specification (see, for example, Linton 
and Perron, 2003). Li et al. (2005) derive a test for the existence of GARCH-M 
effects that allows for a flexible semi-parametric specification of the condi- 
tional variance process. 


Example 8.2 Stock returns and volatility 
Recalling the GARCH models fitted to the daily returns of the S&P 500 index 
in example 5.6, we now fit a GARCH-M model of the form (with the return 
series now denoted y;) 


Yi = Qo + bo, + Ut 
Ut = Et — OEp-1 


E(e7|®;-1) = o; = Yo + ViEq-1 + by hy-1 + by hr-2 


i.e. the conditional standard deviation is included as a regressor in the mean 
equation of the previously fitted MA(1)-GARCH(1,2) model. QML esti- 
mation produces the following model, with robust t-statistics in parentheses: 


Yt = 0.0729 Ot + Et + 0.137 Et-1 
(8.71) (17.5) 


o? = 0.89x10~° + 0.104 €;_1 + 0.598 07_, +0.29407_, 
(6.76) (11.13) (5.42) (2.87) 


Ô = 5.89(24.04) 


The inclusion of g, in the returns equation is an attempt to incorporate a 
measure of risk into the returns-generating process and is an implication of 
the ‘mean-variance hypothesis’ underlying many theoretical asset pricing 
models, such as the intertemporal CAPM discussed above. Under this 
hypothesis, ô should be positive, and this is found to be the case, so that large 
values for the conditional variance are expected to be associated with large 
returns. The MA(1) error may capture the effect of non-synchronous trading 
and is highly significant. As before, the GARCH parameters sum to almost 
unity, indicating IGARCH behaviour and high persistence in the conditional 
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variance. Similar models have been estimated by French, Schwert and 
Stambaugh (1987) for daily excess returns, defined to be the market return 
minus the risk-free interest rate. 


Example 8.3 Conditional variance and the risk premium in 

the foreign exchange market 
The evidence provided by Hansen and Hodrick (1980) for the rejection of 
the ‘simple’ efficiency hypothesis in foreign exchange markets, which was 
discussed in example 8.1, finds that rejection was often due to the intercept 
Œo being non-zero. This finding could be regarded as evidence of a risk 
premium, the presence of which would allow the forward rate to be a biased 
predictor of the future spot rate without sacrificing the notion of market 
efficiency. Of course, for this to be plausible, we must have an empirically 
tractable theory of a risk premium, for without such a theory there is no way 
of empirically distinguishing between an inefficient market and a, perhaps 
time-varying, risk premium. 

Although several theoretical models have been proposed that generate a risk 
premium in the foreign exchange market, it has been found to be extremely 
difficult to translate them into testable econometric models, and, conse- 
quently, their empirical performance provides only weak support for a time- 
varying risk premium. Domowitz and Hakkio (1985) therefore present a 
GARCH-M generalisation of the model used in example 8.1 to investigate the 
possible presence of a risk premium that depends on the conditional variance 
of the forecast errors. From example 8.1, the efficiency hypothesis states that 
the forward rate at time t, f,,;, is an unbiased predictor of the future spot rate, 
S++ where, as before, logarithms are used, but where we now set the forecast 
period at k= 1 for convenience. Thus, 


St+1 — fin = Ut+1 


where u41 is the one-period forecast error, which should be zero-mean white 
noise under the efficiency hypothesis. 
This can equivalently be written as 


Ase41 = (fe — Se) + ti 


which is then regarded as a restricted case of the GARCH-M model 
of equations (8.7) to (8.9) with y,=As, and x; = (fii, — s1). The 
restrictions are m= r= 0, so that no lagged y’s or x’s appear in the equation 
for y, and that the forecast error is serially uncorrelated, and fp = 1, so that 
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forecasts are unbiased. Maintaining 6y)=1 and u, to be white noise, then 
ao #0 and ô= 0 implies a non-zero but constant risk premium, while a Æ 0 
and 60 implies a time-varying risk premium. 

The risk premium is given by a + ôo? (assuming à = 2 for convenience), 
and thus any change in it is due solely to changes in the conditional variance 
o7; it can, nevertheless, be positive or negative and can switch signs, 
depending on the values of œo and ô. For example, if a Æ 0 and ô Æ 0, then 
for small forecast errors the risk premium will be negative (long positions in 
foreign currency require an expected loss), while for large forecast errors the 
risk premium may turn positive (long positions in forward foreign currency 
require an expected profit). 

The model was fitted, with o? assumed to follow an ARCH(4) process, to 
non-overlapping monthly data from June 1973 to August 1982 for five 
exchange rates vis-à-vis the US dollar: those of the United Kingdom, France, 
Germany, Japan and Switzerland. The null hypothesis of no risk premium 
(æo =0, o= 1, and ô= 0) could be rejected for the United Kingdom and 
Japan, but not for France, Germany or Switzerland, although for this last 
currency it is only because the standard error of fy is so large that the null 
cannot be rejected, for the point estimate of Bp is — 1.092! 


8.2.2 GARCH option pricing models 


Stochastic volatility and GARCH models have begun to be used in option 
pricing. Stochastic volatility has not been very popular in practice since the 
models are difficult to implement and test. GARCH models, as we have 
shown, have a well-defined theoretical framework and the significant 
advantage that options can be priced solely on the basis of historical spot 
asset returns, without necessarily resorting to option market data. Moreover, 
the diffusion limits of GARCH models encompass many of the well-known 
stochastic volatility models: see, for example, Nelson (1990b) and, for a 
review of this literature, Christoffersen and Jacobs (2004). In most of the 
approaches that have been proposed, once the GARCH model is estimated, 
option pricing involves tedious numerical approximations through simu- 
lation or series expansions. A practical approach has been developed by 
Heston and Nandi (2000), who propose a closed-form option pricing model 
where spot asset returns follow a GARCH-M process. 

For the GARCH(1,1)-M case, the conditional mean equation for the 
logarithmic returns is given by 
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Yt = TF + Ky = 0.50; + Jom 


where 1; is the risk-free rate, k,=(A + 0.5)o;, is a predictable risk premium 
and wu, is a standard normal disturbance. The conditional variance of the 
returns is governed by the following equation: 


o; = yo + Yı (E1 — ôo) +002 


The specification is very similar to that of the non-linear asymmetric 
GARCH model of Engle and Ng (1993). The variance process remains sta- 
tionary with finite mean and variance as long as # + y,5° < 1. The kurtosis 
of the returns distribution is determined by yı, and when this is zero we 
obtain a deterministic time-varying variance. The parameter ô controls the 
asymmetric effect of the disturbances, but the model is not fully consistent 
with the leverage effect as the quadratic specification of the GARCH process 
cannot generate a negative relationship between ‘good news’ and volatility 
(see Yu, 2005). By making appropriate transformations, Heston and Nandi 
(2000) use the characteristic function of the logarithm of the spot price to 
derive their option pricing formula. 

Duan, Gauthier and Simonato (1999) employ Edgeworth expansions to 
derive an analytical approximation for European option pricing that assumes 
a similar process to that above for the conditional variance: 


o = Yo + V\Or-1(Et-1 = 8)’ +o07_, 


In this case, the risk premium in the conditional mean equation is assumed to 
be x;=Ao*. The non-negative parameter ô is likely to capture the negative 
relationship between returns and volatility. The remaining parameters must 
remain positive to ensure the positivity of the conditional variance. The sta- 
tionarity condition for the variance is y,(1 +8) +¢ < 1, while the uncon- 
ditional variance is given by yı/(1 — yı(1 + 5°) —@). Note that, when the 
conditional variance follows a simple GARCH(1, 1) process, it can be shown 
that this model converges at higher sampling frequencies to the CIR con- 
tinuous-time process, and the corresponding option pricing model to that of 
Heston (1993). 

The use of GARCH parameter estimates in place of the true but unknown 
parameters in option pricing formulae gives rise to an ‘estimation risk’ 
problem, since the estimates of the parameters will also affect the estimate of 
the corresponding option price. An additional complication arises from the 
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fact that even an unbiased estimate of the variance will not necessarily produce 
an unbiased estimate of the option price, since option pricing models are 
highly non-linear with respect to the variance. Dotsis and Markellos (2007) 
study the behavior of the Heston and Nandi (2000) option pricing model 
when the GARCH parameters are estimated via ML in finite samples. 
Although they find that the GARCH estimates contain significant biases even 
with samples of three years of daily data, the unconditional variance estimates 
are found to be relatively unbiased. In terms of option pricing, large over- 
pricing appears only for short-term, out-of-the-money option configurations, 
and Dotsis and Markellos show that jackknife resampling is an effective 
method for reducing bias. Reviews of the rapidly expanding literature on the 
econometric and empirical issues involved in option pricing are given by 
Garcia, Ghysels and Renault (2007) and Bates (2003). 


8.3 Misspecification testing 


The regression techniques developed in section 8.1 are based on the 
assumption that the model (8.1) is correctly specified — Le. that the 
assumptions underlying the model are valid. If they are not, then some of 
the techniques can be invalidated. It is important therefore to be able to test 
these assumptions: such tests are known as misspecification tests, and we 
begin their development by rewriting (8.1) as 


Ms 


Vt = Ao Box: (Qiyt-i + B;X:-i) + Uy 


ll 
ee 


(8.10) 


lo + Box: P; Zii + uy 


Ms 


1 


where B; = (a;,B;), so that B = (Bi, Bis., B). 


8.3.1 Choosing the maximum lag, m 


The estimation theory developed in section 8.1 is based on the assumption that 
the maximum lag, m, is known. If this is so, then the assumption of mixing, 
which lets the errors u, exhibit both serial correlation and heterogeneity, still 
allows the LS estimate By to be consistent and asymptotically normal, although 


the associated covariance matrix is Dr = (x'X / T) OVE (x'X / T) go where 
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the expression for Vr is given by equation (8.4). If m is chosen to be larger than 


its optimum but unknown value m*, Br will still be consistent and asymp- 
totically normal, but multicollinearity problems will often arise. This is because, 
as m increases, the same observed data {z;}{ are required to provide more and 
more information about an increasing number of unknown parameters. 

If, on the other hand, mis chosen to be ‘too small’, then the omitted lagged 
z,s will form part of the error term. If we assume that for the correct lag length 
m*, u, is a martingale difference, then the error term in the misspecified 
model will no longer be non-systematic relative to (y2 ; x?) and hence will 


not be a martingale difference. This has the implication that Br and ôr are no 
longer consistent or asymptotically normal, and, because of this, it is 
important to be able to test for m < m*. Given that the ‘true’ model is 


Ve = Œo + Box, + So Bit i + uy 


i=1 


the error term in the misspecified model can be written as 


ae 
x K 
uU, = u + J B; Zt-i 


i=m+1 


This implies that m< m“ can be tested using the null hypothesis 
Ho : Bi.) = ++» = Pir = 0. The Wald statistic for testing this null against the 
alternative that at least one of the vectors ff, m+ 1 < i < m*, is non-zero is 
q=(m" — m)(k+ 1) times the standard F-statistic based on a comparison of 
the residual sums of squares from the regressions with the maximum lag 
length set at m and m“ respectively. The asymptotically equivalent LM statistic 
can be computed as T-R* from the auxiliary regression of # on 
X1,Zr—1,+++,Zt—-m*, Where the uf are the residuals from the estimation of 
(8.10). Both the Wald and LM tests will be asymptotically Ki 


The above analysis has assumed that, for the correct lag length m*, u, is a 
martingale difference. One consequence of incorrectly setting m to be less than 
m* is that the residuals from the regression (8.10) will be serially correlated. An 
alternative LM testis T- R” from the regression of uf On X;, Z;—-1,-.- , Zt-m and 
ak 


W p: 
t—1? 
residual serial correlation, and only an indirect test of lag length specification, 


© Umm» Which will be asymptotically x7,,._,,- This is strictly a test of 


but it points to the difficulty of distinguishing whether residual serial cor- 
relation is a consequence of an incorrect (too small) setting of the lag length m 
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or whether m is correct but, nevertheless, the error term is serially correlated. 


As we have seen, in the former case Br will be inconsistent, whereas in the latter 
it will be consistent and asymptotically normal. For detailed discussion of this 
important distinction, see Spanos (1986). 


8.3.2 Testing for normality, linearity and homoskedasticity 


Although the assumption that the errors in (8.10) are normally distributed is 
not a crucial one in the context of the asymptotic theory developed in 8.1, its 
invalidity can have an important affect on LS estimates in finite samples; since 
chapter 7 has shown that many financial time series are observed to be non- 
normal, it is important to examine this normality assumption in regression 
applications. A popular test proposed by Jarque and Bera (1980) measures 
departures from normality in terms of the third and fourth moments — Le. the 
skewness and kurtosis — of the residuals i, from estimation of (8.10). Letting 
u3 and u, be the third and fourth (central) moments of u, and defining 
m = (t3/0°) and m, = ({14/0") to be the moment measures of skewness and 
kurtosis, respectively, estimators of these measures are given by 


ny = CNE E 


The asymptotic distributions of these estimators under the null hypothesis 
of normality are 


and, since they are also asymptotically independent, the squares of their 
standardised forms can be added to obtain 


OTTE 
m 
6 °> 24 


(114 ya % 


so that large values of this statistic would flag significant departures from 
normality. 

The model (8.10) assumes that the conditional mean E (ydy? 1x?) is 
linear in X,. To test this assumption we may consider the null hypothesis 


Ho : by = E(ydyl x?) = XB 
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which needs to be tested against the non-linear alternative 
H; : uy = h(X:) 


If h(-) is assumed to take the form 


h(X,) = X84 OM | C3 yy +... Cubby 


then Ramsey’s (1969) RESET test for linearity is based on testing Ho : & = 
G&G =... = C= 0 against H,:c;40,i=2,..., n. Its LM version is based on the 
auxiliary regression of îy ON Xp Z;_1.--5 Z¢—m and lee TEP, By where 
hy =} = X,ĝr, so that T- R? is asymptotically distributed as x2. If non- 
linearities are encountered then non-linear regression techniques will be 
required; these are developed in White and Domowitz (1984) and analysed in 
detail in Gallant and White (1988). 

To test for departures from homoskedasticity (assuming no serial cor- 
relation), we may consider constructing a test based on the difference 


(X QX) — o (X'X) 


where Q = diag(o%,,;,07,,9,---,07). This can be expressed in the form 
T 
XO (E(u;) — 07) XX) 
t=m+1 


and a test for heteroskedasticity could be based on the statistic 


T 
T! XO (i — 6})X:X} 


t t 
t=m+1 
Given that this is symmetric, we can express the }K(K — 1), where again 
K=(m+1)(k+1), different elements in the form 


T! 3 (i — ô7) Y, (8.11) 


t=m+1 
where 


- 
Y, = (Pin Yn- Wa) o Vi = XiXe 
i>j, ij=2,..., k, l=1,2,...,J, SSI ST) 
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the x;, being columns of X, Although a test statistic can be based on (8.11), 
an asymptotically equivalent LM test (White, 1980) is the T- R? statistic 
computed from the auxiliary regression of i? on a constant and Yip... , Wyp 
which is asymptotically distributed as x}. Note, however, that the constant in 
the original regression (8.10) should not be involved in defining the ws in 
the auxiliary regression, since the inclusion of such regressors would lead to 
perfect multicollinearity. 

This test, of course, does not propose any alternative form of hetero- 
skedasticity. If such information is available — for example, that the errors 
follow an ARCH process — then tests specifically tailored to the alternative 
can be constructed. In the ARCH case the appropriate LM test is T- R? from 
the regression of i? on a constant and lags of ix? (cf. the testing of ARCH in 
chapter 5, section 5.6). 


8.3.3 Parameter stability 


Throughout this analysis we have assumed that the parameter vector f is 
time-invariant. Evidence has accumulated that this may be a rather heroic 
assumption in many regression applications in finance: see, for example, 
the references and results in Coutts, Roberts and Mills (1997). Parameter 
instability may occur in many different forms, and testing for departures 
from parameter time invariance is not straightforward. One approach is 
to use recursive and rolling estimates of the parameters to assess stability. 
A recursive least squares procedure estimates the parameters over an 
increasing sequence of samples m+ 1,...,4T+m+k+1<t< T, yielding 
the recursive estimates Bp for t=t+m-+k+1,...,T, where t is chosen 
to provide an adequate number of degrees of freedom when starting 


the recursion. Note that, by definition, po = Br. The recursive residuals 
are defined as v; = u1 / ft, where the prediction error u,;—; is defined as 


Utt- = Ve — X p” 
and 

-1 
f = 1 + x) (Xi_yX-n) X; 


where Xo =(Xm+p- -> Xz). 
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Subsample estimates may also be constructed; these may be denoted as 


por?) when the estimation period is from t, to b. When the estimation 
period is sequentially incremented by one observation, then sequences of 
rolling regressions with estimation window f, — b + 1 are obtained. 

All these estimates may be used to examine whether the parameters of 
(8.10) are stable. Plots of the recursive and rolling regression coefficients are 
simple to construct and are often very informative, but there are also a range 
of formal test statistics available. For example, the cumulative sum of squares 
(CUSUMSQ)) statistic, originally proposed by Brown, Durbin and Evans 
(1975) and defined as 


T 
$= Bone T =t+m+k+2 


provides a simple test of parameter stability. If S, lies outside the range 


co + t/(T — 2), where cy depends on the chosen level of significance, then 
there is evidence of some form of parameter instability. Edgerton and Wells 
(1994) have provided a range of critical values for the statistic, as well as an 
algorithm for calculating probability values. Although Kramer and Ploberger 
(1990) highlight the poor power properties of the CUSUMSQ test against 
structural change, it does have good properties against heteroskedasticity. 
This is important here, because if the parameters of (8.10) are time-varying 
but are estimated as being constant, as is implied by LS, then the residuals 
will be heteroskedastic. Thus, a test for heteroskedasticity may also be 
interpreted as a test for parameter constancy. Similarly, parameter instability 
may also lead to serial correlation in the recursive residuals, so that port- 
manteau statistics may be calculated using the v, 

Ploberger, Krämer and Kontrus (1989) consider a test based on recursive 
coefficients rather than on recursive residuals. Their fluctuation test is 


defined as 
vo N) 


and critical values are provided in their table 8.1. 

Following Dufour (1982), the recursive residuals can also be used to 
explore parameter instability within an auxiliary regression framework. For 
example, regressing v, on x, provides a general exploratory test, whereas 
regressing v, on sets of dummy variables defined to represent periods of 
possible instability provides more specific tests of parameter constancy. If 
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specific break points are hypothesised, then versions of the traditional Chow 
(1960) test may be computed: for details, see, for example, Hendry and 
Doornik (2006). A test that may be used without selecting explicit break 
points is that proposed by Hansen (1992), which is discussed in Johnston 
and DiNardo (1997, chap. 4). We emphasise that the tests discussed here are 
by no means exhaustive, having been chosen primarily because of their 
popularity and ease of computation (which are certainly not independent 
choices, of course). Many other tests have been proposed over the years: Chu, 
Hornik and Kuan (1995), for example, have provided further tests and 
Perron (2006) gives a recent and detailed survey. 


Example 8.4 Testing the CAPM 

The CAPM is an important asset pricing theory in financial economics and 
has been the subject of considerable econometric research. An excellent 
exposition of the derivation of the model, which, as we have noted earlier, 
postulates a linear relationship between the expected risk and return of 
holding a portfolio of financial assets, can be found in Berndt (1991, chap. 2), 
who also considers many of the econometric issues involved in the empirical 
implementation of the model. 

The simple linear relationship between a small portfolio’s return, rp, and 
its associated risk, measured by the standard deviation of returns, op, can be 
written as 


tp — tf = (Op/Om) ` (tm — rF) (8.12) 


where r„ and o, are the returns on the overall market portfolio and the 
standard deviation of such returns, respectively, and ryis the return on a risk- 
free asset. The term r,— rpis thus the risk premium for portfolio p, while 
Tm — Tris the overall market’s risk premium. Denoting these risk premia as y 
and x, respectively, letting $ = 0 p/0 m and adding an intercept term a and a 
stochastic error term u, the latter reflecting the effects of specific (unsys- 
tematic) and diversifiable risk, the CAPM becomes the simple linear 
regression 


y=a+pxt+u (8.13) 
The LS estimate of the slope coefficient B is Ê = Cov(x, y) / V(x), which is 


equivalent to opm / o? , where Op» is the covariance between portfolio p and 
the market portfolio; this is known as the ‘investment beta’ for portfolio p, 
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and measures the sensitivity of the return on the portfolio to variation in the 
returns on the market portfolio. Portfolios having fs in excess of unity are 
thus relatively risky, while those with fs less than unity are much less sen- 
sitive to market movements. 

LS estimation of the CAPM regression from observed time series {y;, ay 
is, of course, trivial. In this time series context, however, the underlying 
CAPM theory requires certain assumptions to hold. Specifically, we must 
assume that the risk premia are stationary, normally distributed and serially 
uncorrelated, in which case the error process {u}? will be normally and 
independently distributed (NID). Note also that the intercept œ has been 
included without any justification, for it does not appear in the original 
CAPM expression (8.12). The CAPM theory thus provides the testable 
hypothesis æ = 0, along with the following implications: the residuals of the 
regression (8.13) should be serially uncorrelated, homoskedastic and normal, 
the systematic relationship between y and x should be linear, and the esti- 
mate of 6 should be time-invariant. 

The empirical performance of the CAPM was investigated using the data 
set provided by Berndt (1991, chap. 2), which contains monthly returns from 
January 1978 to December 1987 on seventeen US companies plus a monthly 
risk-free return. Treating each company’s risk premia, calculated as the 
difference between the company return and the risk-free return, as a separate 
portfolio enabled seventeen CAPM regressions of the form (8.13) to be 
estimated, and these are reported in table 8.1. 

Only three of the estimated regressions survive the battery of mis- 
specification tests unscathed: those for CONED, DELTA and MOTOR (see 
Berndt, 1991, for the actual companies associated with these variable names). 
Little evidence of serial correlation or non-linearity is found in the residuals 
but, rather, more evidence of heteroskedasticity, non-normality and par- 
ameter non-constancy is encountered. Standard errors calculated using (8.3) 
have a tendency to be larger than their OLS counterparts for betas, but 
smaller for intercepts, although the differences are usually quite small. Those 
regressions that exhibited significant ARCH were estimated with GARCH 
errors, but little change was found in the coefficients of the mean equation. 
GARCH-M extensions were found to be unnecessary in all cases. 


Example 8.5 Further modelling of the FTA All Share index 
In example 2.6 we fitted an ARMA(2,2) process to the logarithmic 
changes of the FTA All Share index, which we now denote as Ap, Mills 
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Table 8.2 Estimates of the FTA All Share index regression (8.14) 


1 Api Ap_» Ap; A120 Ar20_, Ar20_,  Ar20_; 
—0.0035 0.515 0.114 0.072 —0.409 0.180 0.005 0.009 
(0.0013) (0.048) (0.053) (0.047) (0.076) (0.038) (0.039) (0.039) 
[0.0016] [0.067] [0.075] [0.051] [0.106] [0.046] [0.042] [0.039] 
Ad Adı Ad Ad; R 6 Was7 
0.820 0.172 0.070 —0.012 0.787 0.0273 3.55 
(0.021) (0.043) (0.043) (0.022) 
[0.043] [0.058] [0.052] [0.019] 

Notes: 

(...)= conventional standard error; [...] = Newey—West standard error. 


Wag7 = Wald statistic (8.5) computed using T= 487 observations; there are q= 6 restrictions, and hence it 


is asymptotically distributed as x, the 5 per cent critical value being 12.59. 


(1991a) finds evidence that Ap; is related to the logarithmic changes in long 
interest rates and dividends, and we therefore investigate the extended 
regression model 


3 3 3 
Ap = 00+ Apit) BuAr20i +Y BoiAdrit uy (8.14) 


Here r20, and d, are the logarithms of twenty-year gilts and the dividend 
index, respectively, so that k= 2, and the lag length is set at m= 3, although 
this could be selected using an information criterion, by an obvious exten- 
sion to the discussion in example 2.3. Unit root tests confirm that both series 
are I(1), hence their appearance in first-differenced form. 

Estimates of this model are presented in table 8.2, where it is seen that 
many of the coefficients are insignificant, particularly when measured against 
the Newey—West (1987) standard errors, computed using (8.4) with n=5. 
The following set of hypotheses was therefore tested: 


a3 = 0 
Bu = By = By; =0 
Ba = Bos =0 


The Wald statistic reported in table 8.2 shows that this joint hypothesis 
cannot be rejected, the associated marginal significance level being 0.74, and 
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estimation of the restricted equation yields 


Ap; = — 0.0035 + 0.534 Ap,_; + 0.166 Ap;_2 
[0.0014] [0.063] [0.055] 


— 0.181 Ar20, + 0.817 Ad, + 0.156 Ad,_» 
[0.044] [0.042] 0.054] 


R? =0.786, ô= 0.0271 


The current change in the gilt yield enters negatively, reflecting the well 
known trade-off between the equity and gilt markets in the United Kingdom, 
while the current and lagged changes in the dividend yield enter positively. 

The additional regressors reduce the residual standard error over the 
univariate model (see example 2.6, but note the rescaling of the standard 
error), but, as both contain contemporaneous terms, they are of little use in 
forecasting and, of course, beg the question of whether they can be regarded 
as exogenous — a question we return to later. 


8.4 Robust estimation 


As we have seen from the above examples, and from the variety of results 
presented in, for example, Coutts, Mills and Roberts (1994) and Mills and 
Coutts (1996), the non-normality of residuals may be a common occurence, 
being typically caused by the presence of some abnormally large outliers. 
Non-normality, per se, may not have important consequences theoretically, 
since, although LS estimators are no longer asymptotically efficient, they 
nevertheless remain unbiased and consistent, and standard hypothesis tests 
are still asymptotically x?. The power of such tests can be extremely sensitive 
to departures from normality and can lack robustness, however, in the sense 
that the finite sample distribution can be altered dramatically when the 
distribution of the error is altered only slightly (see Koenker, 1982). 

Moreover, if the error variance is infinite, LS estimators lose their min- 
imum variance property, and, since it is then impossible to obtain a 
meaningful estimate of the variance, conventional hypothesis tests can be 
very misleading. The strong likelihood of non-normal, and possibly infinite 
variance, errors has therefore led to the development of alternative estima- 
tion procedures that, relative to LS, place less weight on outliers, and these 
are generally known as robust estimators. 

A wide variety of robust estimators have been proposed, and we will 
concentrate here on methods based on regression quantiles: for financial 
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applications, see, for example, Tomczyk and Chatterjee (1984), Chan and 
Lakonishok (1992) and Mills and Coutts (1996), and, for a general textbook 
treatment, see Rousseeuw and Leroy (2003). The regression quantile family 
of estimators is based on minimising the criterion function 


So Pol ur) 


where, for 0<6< 1, 


fii O| u| if u, > 0 
Poh) = 3 (1 0jul if m <0 


Since rọ(u,) is a weighted sum of the absolute values of the residuals, outliers are 
given less importance than under a squared residual criterion. When 0 = 0.5 
the least absolute errors (LAE) estimator is obtained, whereas, more generally, 
large (small) values of 0 attach a heavy penalty to observations with large 
positive (negative) residuals. For example, for a given value of 6, a bivariate 
regression line passes through at least two observations, with at most T9 obser- 
vations lying below the line and at least (T — 2)6 observations lying above it. 

Varying 0 between zero and one yields a set of ‘regression quantile’ esti- 
mators A(0) — for example, the LAE estimator is 6(0.5). The effect of large 
outlying observations will tend to be concentrated in the regression quantiles 
corresponding to extreme values of 0, while the behaviour of the sample 
observations will determine how the regression quantiles change as 6 varies. 
Consequently, a variety of estimators have been proposed that combine 
several regression quantiles — for example, the trimean (TRM): 


Bram = 0.25B(0.25) + 0.58(0.5) + 0.258(0.75) 


The regression quantiles can also be combined in the form of a trimmed 
regression quantile (TRQ) estimator, 

X ae 

eas) Boe 


where 0<@<0.5. This estimator is obtained by computing ĝÊ(¢) and 
B(1 — $), excluding all observations lying on or below the ¢th regression 
quantile line and all those lying above the (1—@)th quantile line, and 
applying OLS to the remaining observations. It can thus be interpreted as a 
‘trimmed least squares’ estimator (Ruppert and Carroll, 1980). All these 


estimators can be shown to produce asymptotically normal estimators of £, 
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Table 8.3 Robust estimates of the CAPM regression 


CONTIL DEC 
a B a B 
OLS —0.013 (0.013) 0.73 (0.19) 0.007 (0.007) 0.85 (0.11) 
LAE —0.013 (0.008) 0.67 (0.11) 0.007 (0.009) 0.74 (0.13) 
TRM —0.017 (0.004) 0.66 (0.05) 0.005 (0.004) 0.77 (0.06) 
TRQ(¢=0.1) —0.018 (0.007) 0.62 (0.11) 0.005 (0.007) 0.71 (0.10) 
TRQ(¢=0.2) —0.017 (0.008) 0.63 (0.11) 0.004 (0.008) 0.78 (0.11) 
GERBER MOBIL 
a B a B 
OLS 0.005 (0.007) 0.63 (0.10) 0.004 (0.006) 0.72 (0.09) 
LAE —0.008 (0.009) 0.57 (0.14) 0.004 (0.007) 0.59 (0.10) 
TRM —0.001 (0.004) 0.57 (0.06) 0.003 (0.003) 0.63 (0.04) 
TRQ(¢=0.1) —0.001 (0.007) 0.58 (0.10) 0.002 (0.006) 0.64 (0.08) 
TRQ(¢=0.2) —0.002 (0.007) 0.58 (0.10) 0.002 (0.006) 0.60 (0.09) 
PANAM PSNH 
a B a B 
OLS —0.009 (0.011) 0.74 (0.16) —0.013 (0.010) 0.21 (0.15) 
LAE —0.019 (0.009) 0.60 (0.13) —0.007 (0.006) 0.21 (0.09) 
TRM —0.013 (0.006) 0.68 (0.08) —0.009 (0.005) 0.24 (0.07) 
TRQ(¢=0.1) —0.010 (0.011) 0.65 (0.16) —0.008 (0.008) 0.19 (0.11) 
TRQ(¢=0.2) —0.012 (0.010) 0.65 (0.14) —0.008 (0.006) 0.24 (0.09) 
TANDY TEXACO 
a B a B 
OLS 0.011 (0.010) 1.05 (0.14) 0.001 (0.006) 0.61 (0.09) 
LAE 0.004 (0.013) 0.96 (0.18) —0.002 (0.006) 0.54 (0.09) 
TRM 0.008 (0.005) 0.94 (0.08) —0.002 (0.003) 0.58 (0.05) 
TRQ(¢=0.1) 0.007 (0.010) 0.99 (0.14) —0.002 (0.005) 0.55 (0.08) 
TRQ(¢=0.2) 0.008 (0.010) 0.95 (0.15) —0.002 (0.005) 0.57 (0.07) 


with appropriate covariance matrices given in, for example, Judge ef al. 
(1985, chap. 20), where a detailed treatment of robust estimators in 
econometrics in general can be found. 


Example 8.6 Robust estimation of the CAPM 
The eight CAPM regressions found to have significant non-normality in 
example 8.4 were re-estimated using four robust techniques: LAE, TRM, and 
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TRQ with the trimming parameter set at @=0.1 and 0.2. These estimates, 
along with the OLS estimates for comparison, are reported in table 8.3. In 
seven of the regressions the robust beta estimators are consistently smaller 
than the OLS, while for the eighth, that of PSNH, the standard errors are 
sufficiently smaller to render the estimates significant. A similar pattern 
occurs for the estimates of a: for all except PSNH the robust estimates are 
smaller than the OLS. Moreover, some of the estimates even become sig- 
nificantly different from zero. Interestingly, only for PSNH are the OLS 
residuals negatively skewed. These findings are consistent with, for example, 
Mills and Coutts (1996), who also found that robust beta estimates for the 
industry baskets of the London Stock Exchange’s 350 index were smaller 
than their OLS counterparts. 


8.5 The multivariate linear regression model 


An immediate extension of the regression model (8.1) is to replace the 
‘dependent’ variable y, by a vector, say Y, = (Vit, ---, Ynt) >» so that we now 
have the multivariate (dynamic) regression model 


y=C+ 0" Aly, i+ 02 B Xit, m+1<t<T (8.15) 


where C is an nx 1 vector of constants, Aj,..., Am are nx n matrices of lag 
coefficients, Bo, B,,..., B,, are kxn coefficient matrices and u, is an nx 1 
vector of errors having the properties 

E(u,) = E(u, [Y$ x£) =0 


FIN 


and 


E(u) = E(u) d a 


where 
Voy = (Yio Y2 tee 5Y}) 


In matrix form, we have 


Y=X°B+U 
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where 

Y= nee, 

XS (oc ee) 

X; = (yeas: fe Sek ee ree 


U= (Un41) « ene ur)" 


and 


B= (ChA nA BoB) 
The estimation theory for this model is basically a multivariate extension of 
that developed for the univariate case (n= 1) above. For example, the LS and 
(approximate) ML estimator of B is 


while the ML estimator of Q is 
Q=T'O'U, u=y-x*8 


Spanos (1986, chap. 24) considers this model in some detail, presenting 
misspecification tests that are essentially multivariate extensions of those 
outlined in section 8.3. 


Example 8.7 Multivariate tests of the CAPM 
Since the publication of Gibbons (1982), multivariate tests of the CAPM 
have been the subject of considerable research: for a detailed treatment, see 
Campbell, Lo and MacKinlay (1997, chap. 5). The multivariate CAPM can be 
analysed empirically within the framework of the multivariate regression 
model. By letting y, be the vector of n excess asset returns at time t and x, be 
the excess market return at time t, the model can be written as 


y, = C + Bx, + u, 


where C and B are nx 1 vectors of parameters and the error u; is assumed to 
have the properties of the error in equation (8.15). The CAPM imposes the n 
restrictions that the intercepts in each asset return equation are zero — i.e. 
C=0. MacKinlay (1987; see also Gibbons, Ross and Shanken, 1989) shows 
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that this hypothesis can be tested using the statistic 


Arene EE. PN. nae al 
i= (T —2)n e SAE 


x 


Under Hp : C=0, J is distributed as F with n and T— n— 1 degrees of 
freedom. 

The seventeen assets considered separately in example 8.4 were re-exam- 
ined in this multivariate framework. Of course, since the same (single) 
regressor appears in each equation, the slope and intercept estimates are the 
same as the single-equation OLS estimates. A test of C= 0 produces a J value 
of 0.71, with an associated marginal significance level of 0.79. Not surpris- 
ingly, given the intercept estimates reported in table 8.2, we cannot reject the 
null that all the intercepts are zero, in accordance with the predictions of the 
CAPM, although we should emphasise that none of the misspecifications 
uncovered in the individual asset models in example 8.4 have been tackled 
here. 


8.6 Vector autoregressions 


8.6.1 Concepts of exogeneity and causality 


Throughout the various forms of regression models encountered so far in 
this chapter we have made the assumption that y, is a function of past values 
of itself and present and past values of x, More precisely, we have been 
assuming that x, is weakly exogenous: the stochastic structure of x, contains 
no information that is relevant for the estimation of the parameters of 
interest, B and Q. Formally, x, will be weakly exogenous if, when the joint 
distribution of z; = (y} , ed ae conditional on the past, is factorised as the 
conditional distribution of y, given x; times the marginal distribution of x; 
(a) the parameters of these conditional and marginal distributions are not 
subject to cross-restrictions, and (b) the parameters of interest can be 
uniquely determined from the parameters of the conditional model alone. 
Under these conditions x, may be treated ‘as if? it were determined outside 
the conditional model for y, 

For more details on weak exogeneity, see Engle, Hendry and Richard 
(1983), Engle and Hendry (1993) and Hendry (1995). Engle and Hendry 
(1993) extend weak exogeneity to that of superexogeneity: x, will be 
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superexogenous if it is weakly exogenous for B and Q and if the parameters 
of the conditional distribution of y, are invariant to interventions that affect 
the marginal distribution of x, 

While the weak exogeneity of x, allows efficient estimation of B and Q 
without any reference to the stochastic structure of x, the marginal distri- 
bution of x, while not containing y, will contain Y?_,, and the possible 
presence of lagged ys can lead to problems when attempting to forecast y, In 
order to be able to treat x, as given when forecasting y, we need to ensure 
that no feedback exists from Y?_, to xg the absence of such feedback is 
equivalent to the statement that y, does not Granger-cause x, Weak exo- 
geneity supplemented with Granger non-causality is called strong exogeneity. 

Unlike weak exogeneity, Granger non-causality is directly testable (the ori- 
ginal reference to this concept of causality is Granger, 1969). To investigate such 
tests, and to relate Granger non-causality to yet another concept of exogeneity, 
we need to introduce the dynamic structural equation model (DSEM) and the 
vector autoregressive (VAR) process. The DSEM extends the multivariate 
regression model in two directions: first, by allowing ‘simultaneity between the 
‘endogenous’ variables in y, and, second, by explicitly considering the process 
generating the ‘exogenous’ variables x, We thus have (in this and the subse- 
quent subsection constant terms are omitted for simplicity of notation) 


Ay, = 3o Ay,_,+ ae Bx;-; + Wr (8.16) 
and 
x = J p Cixi + un (8.17) 


The simultaneity of the model is a consequence of Ay Æ I„. The errors uj; 
and wu; are assumed to be jointly dependent processes, which could be 
serially correlated but will be assumed here to be white noise, and intercept 
vectors are omitted for simplicity: see Mills (1990, chap. 14) and, in par- 
ticular, Lütkepohl (1991) for a more general development. The identification 
conditions for the set of structural equations (8.16) are summarised in 
Hendry, Pagan and Sargan (1984), while (8.17) shows that x, is generated by 
an mth-order VAR process, in which current values of x are functions of m 
past values of x only. 

If, in the DSEM (8.17), E(u,,x;_,)=0 for all s, x, is said to be strictly 
exogenous. Strict exogeneity is useful because no information is lost by 
limiting attention to distributions conditional on x, which will usually result 
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in considerable simplifications in statistical inference — for example, IV 
techniques may be used in the presence of serially correlated disturbances. A 
related concept is that of a variable being predetermined: a variable is pre- 
determined if all its current and past values are independent of the current 
error u+ If x; is strictly exogenous then it will also be predetermined, while if 
E(u, ys) =0 for s>0 then y,_, will be predetermined as well. 

In many cases, strictly exogenous variables will also be weakly exogenous 
in DSEMs, although one important class of exceptions is provided by 
rational expectations variables, in which behavioural parameters are gener- 
ally linked to the distributions of exogenous variables. Similarly, predeter- 
mined variables will usually be weakly exogenous, except again in the case 
where there are cross-restrictions between behavioural parameters and the 
parameters of the distribution of the predetermined variables. 

Strict exogeneity can be tested in DSEMs by using the final form, in which 
each endogenous variable is expresssed as an infinite distributed lag of the 
exogenous variables 


y= Yo Jix;it+e: 


where the J; matrices are functions of the Ajs and Bj and where e, is a 
stochastic process possessing a VAR representation and having the property 
that E(e,x;_,) =0 for all s. Geweke (1978) proves that, in the regression of y; 
on all current, lagged and future values of x, 


y, = Eo K;x;_; + & (8.18) 


there will exist a DSEM relating x, and y, in which x, is strictly exogenous if, 
and only if, the coefficients on future values of x, (i.e. X;_ 5 s < 0) are all equal 
to zero. An equivalent test is based on the regression 


x = D0 Bait D0, Bay, tw (8.19) 


in which E (y,_ iw, ) = 0 for all t and s>0. Geweke proves that x, will be 
strictly exogenous in a DSEM relating x, and y, if, and only if, the coefficient 
matrices Fp; i= 1, 2,... are all zero. 

Strict exogeneity is intimately related to Granger non-causality. Indeed, 
the two tests for strict exogeneity of x, above can also be regarded as tests for 
y: not Granger-causing x, The two concepts are not equivalent, however. As 
Geweke (1984) points out, if x, is strictly exogenous in the DSEM (8.16) then 
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y; does not Granger-cause x, where y, is endogenous in that model. If y, does 
not Granger-cause x,, however, then there exists a DSEM with y, endogenous 
and x, strictly exogenous, in the sense that there will exist systems of equa- 
tions formally similar to (8.16), but none of these systems necessarily satisfy 
the overidentifying restrictions of the specific model. This implies that tests 
for the absence of a causal ordering can be used to refute the strict exogeneity 
specification in a given DSEM, but such tests cannot be used to establish it. 

Furthermore, as we have already discussed, statistical inference may be 
carried out conditionally on a subset of variables that are not strictly 
exogenous; all that we require is that they be weakly exogenous. Thus, 
unidirectional Granger causality is neither necessary nor sufficient for 
inference to proceed conditional on a subset of variables. 


8.6.2 Tests of Granger causality 


To develop operational tests of Granger causality, we now consider the 


; ‘ T A 
g=n+ k+ r dimensional vector z; = (yl, X; rj) , which we assume has 


the following mth-order VAR representation (see, for example, Sims, 1980): 
z=) Iliz- +v (8.20) 


where 

E(v:) = E(v,|Z?_,) = 0 

a E(vv?|Z2_,) = ia z 
and 

Zz? = (Zt—1, Zt—2; - - - , Z1) 


The VAR of equation (8.20) can be partitioned as (the r equations 
modelling r, may be ignored here) 


Iez yee Coixe—i + ae Daiy,_; + ae Giit;itVir (8.21) 


sla oe Epix) + ae Poy, i+ ae Go it;-j + Vor (8.22) 
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T 
t 


Zi Xp 
E = 
E =) 


Here Xj = E(viav} )> i, j= 1, 2, so that, although the vectors vı; and vz; are 


where v, = (vii. vl) and where X, is correspondingly partitioned as 


both serially uncorrelated, they can be correlated with each other contem- 
poraneously, although at no other lag. Given equations (8.21) and (8.22), x 
does not Granger-cause y if, and only if, C,;=0, for all i An equivalent 
statement of this proposition is that |Z,| = |X|, where E; = E(wi:wj,), 
obtained from the ‘restricted’ regression 


¥ =>). Oyi t Do), Gott wir (8.23) 
Similarly, y does not Granger-cause x if, and only if, F,;=0, for all i — or, 


equivalently, that |X22| = |E2|, where ©, = E(w»;w},), obtained from the 


regression 
X= Se E,;X;-; + Wines Gajt;—; + Wot (8.24) 
If the system (8.21)—(8.22) is pre-multiplied by the matrix 


| I, -ELE | 
-2E I 


then the first n equations of the new system can be written as 
m m m 
yY = Da C3:X1—i + ae Dsiy,;_; + pr Gsit+-i + Mir (8.25) 


where the error @|; = Vi; — Yds Von since it is also uncorrelated with vz, 
is also uncorrelated with x,. Similarly, the last k equations can be written as 


i pane Esx; + Diss Psiy,_j + ie Goitrj + Orr (8.26) 


Denoting Xo; = Eloro, ), i= 1, 2, there is instantaneous causality between 
y and x if, and only if, C3,40 and E3940, or, equivalently, if 
[X11|>[Loi| and |Xo2|>|LXo,|. 
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Tests of Granger causality can be constructed once estimates of the various 
covariance matrices have been obtained. Consistent and efficient estimates of 
the parameters of the regressions (8.21) to (8.26) are given by LS, so that the 
following matrices can be formed: 


T 
pe (T—m)™* > Wii, 


t=m+1 
f 
á =j KAT 
Li = (T-m) J ViVi 
t=m+1 
T 
7 =j ARAT 
Loi = (T—m) X Oi, 
t=m+1 


for i= 1, 2, where Wi; is the vector of LS residuals corresponding to the error 
vector Wi; etc. The LR test statistic of the null hypothesis Ho; : Cz; = 0 for all 
i (x does not Granger-cause y) is 


LR, = (T — m)log(|Ë1|/[È11]) ~ Xam 
Similarly, the null that y does not Granger-cause x, Hoz : F2;= 0, is tested by 
LR, = (T — m)log(|2|/|Ë22|) ~ Xrm 


while the null that there is no instantaneous causality between y and x, Hos : 
C30 = E39 = 0, is tested by 


LR; = (T — m)log(21|/|Zeil) = (T — m)log(|%2|/|Zu2l) ~ Xin 


Since these are tests of nested hypotheses, they are asymptotically inde- 
pendent. All three restrictions can be tested at once since 


LR, + LR + LR ~ Xnn(2met) 


Wald and LM statistics may be constructed in analogous fashion. 
Although various other tests of causality have been proposed, they tend to 
require considerably more computation, and, in any event, simulation 
studies carried out by a variety of authors reach a consensus that inference 
should be carried out using the procedures detailed above, these being found 
to combine the greatest reliability with computational ease. 
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8.6.3 Determining the order of a VAR 


These tests of causality assume that the order m of the underlying VAR is 
known. In practice, of course, m will be unknown and must be determined 
empirically. A traditional tool for determining the order is to use a sequential 
testing procedure. If we have the g-dimensional VAR given by (8.20), from 
which the ML estimate of X, is 


Ê, n = TVW 


where Vn = (Wms1,---,W7) is the matrix of residuals obtained by LS esti- 
mation of the mth-order VAR — VAR(m) — then, for example, the LR statistic 
for testing m against l, 1< m, is 


LR(m, 1) = (T — gm)log (Evil /[Evaml) ~ Xerem- 


This uses the scaling factor T— gm rather than T to account for possible 
small-sample bias in the statistic. 

Other procedures are based upon minimising some objective function and 
are essentially multivariate analogues of those discussed in example 2.3. The 
objective function that is most favoured is the multivariate BIC criterion, 


defined here as 
BIC(m) = log|Xy,| +g mT 'logT m=0,1,...,m* 


where m* is the maximum order considered. This can be shown to provide a 
consistent estimate of the correct lag order, and Lütkepohl (1985) finds that it 
also chooses the correct order most often, and the resulting VAR models 
provide the best forecasts, in a Monte Carlo comparison of objective functions. 

After a tentative model has been specified using one of these procedures, 
checks on its adequacy may be carried out. These are analogous to the diag- 
nostic checks used for univariate models, and might involve overfitting and 
testing the significance of the extra parameters, plotting standardised residuals 
against time and analysing the estimated cross-correlation matrices of the 
residual series. Multivariate portmanteau and LM statistics are also available, 
but with vector time series there is probably no substitute for detailed 
inspection of the residual correlation structure for revealing subtle relation- 
ships that may indicate important directions for model improvement. 
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8.7 Variance decompositions, innovation accounting and 
structural VARs 


A concise representation of the VAR(m) model is obtained by using lag 
operator notation 


TI(B)z, = V; 


where 


I(B) = I — MB — MB — --- — I, B” 


Analogous to the univariate case, the vector MA representation of z, is 

z; = IT} (B)v; = Y(B)v; = v: + XO Yei (8.27) 
where 

¥= Ty Ya Y= ixi 


In this set-up, no distinction is made between endogenous and (strictly) 
exogenous variables, so the Y; matrices can be interpreted as the dynamic 
multipliers of the system, since they represent the model’s response to a unit 
shock in each of the variables. The response of z; to a unit shock in z; (i.e. to 
v;, taking the value unity, where v; is the jth element of v,) is therefore given 
by the sequence, known as the impulse response function, 


Y;a Vi; reas) 


where ‘Pj; is the ijth element of the matrix W;. If a variable or block of 
variables are strictly exogenous, then the implied zero restrictions ensure that 
these variables do not react to a shock to any of the endogenous variables. 


Recall, however, that E (vavi) = X,, so that the components of v, are con- 


t 
temporaneously correlated. If these correlations are high, simulation of a 
shock to Zj while all other components of z, are held constant, could be 
misleading, as there is no way of separating out the response of z; to z; from 
its response to other shocks that are correlated with vj, 

If we define the lower triangular matrix S such that ss' =, and n = 


S` v, however, then E (nn; ) =I,, so that the transformed shocks n, are 
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orthogonal to each other. We can then renormalise the MA representation 
(8.27) into the recursive form 


= So, (W;S) ( (Ss! Vie i) = te p PPn- i 


where Y? = Y;S (so that Y = oS is lower triangular). The impulse 
response function of z; to a unit shock in z; is then given by the sequence 


yo 


irre 


wo 


ij,1? 


yo 


1j,0? 
where each impulse response can be written compactly as 
Y? = e] Y,Se; (8.28) 


where e; is the nx 1 selection vector containing unity as the ith element and 
zeros elsewhere. This sequence is known as the orthogonalised impulse 
response function. 

The uncorrelatedness of the n,s allows the error variance of the H-step 
ahead forecast of z; to be decomposed into components accounted for by 
these shocks, or innovations: hence the phrase coined by Sims (1981) for this 
technique, that of innovation accounting. In particular, the proportion of the 
H-step ahead forecast error variance of variable i accounted for by the 
orthogonalised innovations to z; is given by 


H 2 
vo = Pie 0 Y? h = Sear (e] Y,Se;) 
2a ej PEY; e; ae el WLP) e; 


For large H, this orthogonalised forecast error variance decomposition 
allows the isolation of those relative contributions to variability that are, 
intuitively, ‘persistent (for further details of this technique, see, for 
example, Doan, Litterman and Sims, 1984). The technique does, however, 
have an important disadvantage: the choice of the S matrix is not unique, 
so that different choices (for example, different orderings of the variables) 
will alter the YP, coefficients, and hence the impulse response functions 
and variance decompositions. The extent of these changes will depend upon 
the size of the contemporaneous correlations between the components of 
the v, vector. 

This non-invariance property has generated much detailed analysis and 
criticism of the variance decomposition methodology, focusing on the 
inability of VARs to be regarded as ‘structural’ in the traditional econometric 
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sense, so that shocks cannot be uniquely identified with a particular variable 
unless prior identifying assumptions are made, without which the computed 
impulse response functions and variance decompositions would be invalid. 

To make this point more concrete, suppose we have a (first-order) DSEM 
but, in keeping with the general philosophy of VARs, no variables are con- 
sidered to be exogenous, at least a priori — i.e. 


Aoy, = Ay, + Buy (8.29) 
We also assume that the structural errors u, have zero cross-correlation: 
hence, 
Ey t=s 
TY) 
E(u,u, ) ea t#s 


and X, is diagonal. The diagonal elements of Ay and B are normalised to 
unity, thus associating each structural equation with a natural left-hand-side 
variable and with a particular structural error. Contemporaneous inter- 
actions are captured by non-zero off-diagonal elements in these matrices, Ao 
capturing interactions between the variables and B modelling the direct 
effects of disturbances on variables other than those appearing on the left- 
hand side of the structural equations. 

Pre-multiplying (8.29) by Ay’ obtains the VAR (8.20) with m=1, IT, = 
Ao "A, and Aov;= Bu, The VAR is thus seen to be the reduced form of the 
DSEM, and the VAR error v, is a linear combination of the errors of the 
DSEM. It is this fact that makes the interpretation of impulse response 
functions and variance decompositions potentially ambiguous. Recall that 
Wj, measures the response of z; to a unit shock in z; after k periods. But a 
shock to z; given by the jth element of v, is now seen to be made up of all 
the structural innovations u, and hence, in the absence of further infor- 
mation, could have been the consequence of a shock to any of the variables 
in the DSEM. The recursive triangularisation introduced above implies that 
v,= Sn, so the recursive innovations and the structural innovations will 
coincide only if S = Aj 1B, which will be satisfied if the DSEM itself has 
the same lower triangular structure — i.e. if B is diagonal and Ay lower 
triangular. 

Numerous authors have argued that these assumptions have no particular 
economic rationale — that they are atheoretical, using the term of Cooley and 
LeRoy (1985). This has led to the development of other sets of identifying 
restrictions that are based more explicitly on economic considerations. 
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Bernanke (1986) and Blanchard (1989), for example, impose alternative sets 
of restrictions on Ao and B that in effect constrain the short-run impact of 
shocks to z, while Blanchard and Quah (1989) exploit a different set of 
restrictions that constrain the long-run effects of shocks to z and thus impose 
restrictions across Ap, A; and B. Swanson and Granger (1997) present a 
method that combines both prior economic knowledge and statistical 
analysis of the VAR residuals. 

An alternative approach has been proposed by Pesaran and Shin (1997), 
extending the work of Koop, Pesaran and Potter (1996). This proposes using 
generalised impulse responses as a means of circumventing the dependence of 
the orthogonalised responses to the ordering of the variables. The generalised 
impulse response is defined by replacing S in (8.28) with o;”E,, where oj is 
the ith diagonal element of X: 


G — naal . 
Vin = 008: W Lye; 


thus leading to the generalised forecast error variance decomposition 


H G2 =] H HR 2 
Tn hao Yin ey Ss (e; ¥,Lye;) 
ijh HOT Ta HoT T 
yey Yra Y, e; ay Yra Y, e; 


The generalised impulse responses are invariant to the ordering of the 
variables, are unique and fully take into account the historical patterns of 
correlations observed amongst the different shocks. The orthogonalised and 
generalised impulse responses will coincide only if X, is diagonal, and in 
general are only the same for j= 1 (Pesaran and Shin, 1997). 

Methods of computing standard errors of the impulse response functions 
in the above situations are discussed in detail in Hamilton (1994, chap. 11.7). 


8.8 Vector ARMA models 


A natural extension of the VAR is the vector ARMA process 


p q 
z = Din Pizai Ve + D i Owes 


@(B)z, = O(B)v, 
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where 

@(B) =I1—®,B—---—@,B 
and 

O(B) =I1—0,B—---—©,B! 


which, of course, admits a VAR(oo) representation with II(B) = @~'(B)®(B). 
Unfortunately, the presence of a vector MA component complicates analysis 
somewhat, and vector ARMA models are rarely used nowadays. Details of 
such models, including estimation methods and model-building techniques, 
may be found in, for example, Mills (1990, chap. 14) and Liitkepohl (1991). 


Example 8.8 The interaction of equity and bond markets in the 

United Kingdom 
The example that is used to illustrate VAR modelling brings together four 
series that have been used in previous examples. These are the FTA All Share 
index and associated dividend index, first introduced in example 2.6, and the 
series on twenty-year UK gilts and ninety-one-day Treasury bills, used to 
construct the spread in example 2.2. Previous examples have shown that the 
logarithms are all I(1), so that the first-differences, Ap, Ad, Ars and A120, 
are individually stationary and hence suitable for modelling in a VAR 
framework. The first three series were analysed within a single-equation 
framework modelling Ap in example 8.5. The sample period is January 
1965 to December 2005, so that the sample size is T= 492. Thus, g= 4 and 
Zt = (Ap, Ad;, Ars;, Ar20,). Table 8.4 presents BIC values for lags 
m=0,1,...,12, along with LR statistics for testing m against m—1, 


beginning at m= 12. The minimum BIC is found at m= 1, whereas the first 
significant LR statistic, using the 5 per cent level for each test, is at m= 2. 
Setting m= 2 revealed no residual autocorrelation, and this order of VAR 
was therefore selected. Summary statistics for the VAR(2) are shown in 
table 8.5, along with the estimated contemporaneous residual correlation 
matrix. 

To illustrate Granger causality testing, we first consider the set of 
pairwise test statistics: for example, with y, = Apr, x; = Ar20, and r; = 
(Ad,, Arst), Granger causality between equity prices and long interest rates 
may be examined. These are shown in table 8.6(a), and indicate strong 
causality running from A720 to the other three variables and modest evi- 
dence of feedback between Ap and Ad. The presence of a large positive 
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Table 8.4 BIC values and LR statistics for determining the order of the VAR in example 8.8 


m BIC(m) LR(m, m—1) 
0 —12.74 — 
1 —13.99F 688.25 
2 —13.87 42.46% 
3 — 13.69 13.68 
4 — 13.53 19.99 
5 —13.36 18.87 
6 —13.19 17.10 
7 —13.00 7.56 
8 —12.83 15.40 
9 —12.65 12.10 
10 —12.49 18.81 
11 —12.32 15.11 
12 —12.15 14.67 
Notes: 


LR(m, m—1) ~ Xe, Xeo = 26.30. 
+= minimum BIC. 
“= first significant LR statistic. 


Table 8.5 Summary statistics for the VAR(2) of example 8.8 


R s.e. Q(12) 
Ap 0.04 0.058 9.6 
Ad 0.38 0.060 14.6 
Ars 0.13 0.054 16.5 
Ar20 0.06 0.034 6.3 
Contemporaneous residual correlation matrix 

Ap Ad Ars Ar20 
Ap A 
Ad 0.88 1 
Ars —0.14 —0.08 1 
Ar20 —0.26 —0.18 0.41 1 


contemporaneous correlation between the residuals of the Ar20 and Ars 
and Ap and Ad equations shows that there is also instantaneous causality 
in the bond and equity markets. A reasonably large but negative contem- 
poraneous correlation between Ap and A720 indicates that a rise in equity 
prices is accompanied by a fall in long interest rates, presumably as 
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Table 8.6 Granger causality tests 


y~x LRy—>x ~ x (2) 


(a) Pairwise causality statistics 


Ad— Ap 5.36 [0.07 
Ars— Ap 0.32 [0.85 
A120 Ap 7.60 [0.02 
Ap— Ad 90.60 [0.00 
Ars— Ad 0.59 [0.74 
Ar20— Ad 6.08 [0.05 
Ap— Ars 0.31 [0.86 
Ad— Ars 0.04 [0.98 
A120 Ars 10.34 [0.00 
Ap— Ar20 3.85 [0.15 
Ad— A120 0.03 [0.98 
Ars— Ar20 0.10 [0.95 


(b) Yr (AppAd,Ars;), x,= (A120, 


LRy> x~ xX (8) 


y>x 21.95 [0.00] 
x>y 10.28 [0.11] 


investors switch out of gilts into higher-yielding equities. Thus, it would 
seem that the primary source of dynamic response is from long interest 
rates to stock prices and to short interest rates — i.e. a positive shock to the 
gilts market leads to a positive shock in the short-term money market and a 
negative shock to the equity market. 

By defining y, = (Apr, Adi, Arst), x = (Ar20;) and r, null, so that n= 3, 
k= 1 and r=0, we can test whether long interest rates jointly Granger-cause 
the other three variables — i.e. whether the gilts market jointly Granger-causes 
the bond and equity markets, and whether there is any evidence of any joint 
feedback from them to the gilts market. From the set of statistics shown in 
table 8.6(b), the gilts market is confirmed to Granger-cause the bond and 
equity markets, but there is no evidence of feedback. 

Two variance decompositions are reported in table 8.7. The first (denoted I) 
uses the ordering defining z, while the second (II) reverses this — i.e. Ar20, 
Ars, Ad, Ap. Own innovations have the major weight in the decompositions, 
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Table 8.7 Variance decompositions 


Explained by 


Ap Ad Ars Ar20 

h I II I II I Il I II 

1 100 22.3 0 71.0 0 0.2 0 6.6 
Ap 3 98.1 21.4 0.3 69.9 0.1 0.2 1.5 8.5 

6 98.1 21.4 0.3 69.9 0.1 0.2 1.5 8.5 

1 76.6 0 23.4 96.7 0 0 0 3.3 
Ad 3 82.9 4.8 15.7 91.1 0 0.3 1.4 3.8 

6 83.0 4.8 15.5 91.1 0 0.3 1.5 3.8 

1 2.0 0 0.7 0 97.3 83.2 0 16.8 
Ars 3 2.7 0 0.8 0.1 94.4 77.9 1.9 22.0 

6 2.7 0 0.8 0.1 94.4 77.9 21 22.0 

1 6.6 0 0.8 0 13.7 0 78.8 100 
Ar20$$ 3 9.7 0.4 0.8 1.5 13.3 0 76.2 98.0 

6 9.7 0.4 0.8 1.6 13.3 0 76.2 98.0 


but many of the other weights are reasonably large, although all the decom- 
positions are, basically, settled after six months. Due to the large contem- 
poraneous correlations, major shifts in weights are found when the ordering is 
changed. In the absence of any structural model suggesting a theoretical 
ordering, there is no way of establishing which of the variance decompositions 
is appropriate. 

Figure 8.1 shows the accumulated generalised impulse responses, and 
these confirm the conclusions drawn from the Granger causality tests and the 
variance decomposition calculations. 


8.9 Multivariate GARCH models 


Increasingly, attention is being focused on building multivariate GARCH 
(MGARCH) models to account for the observation that asset and market 
volatilities appear to be correlated over time. This has also been motivated 
by the fact that, in most asset pricing theories, returns depend on the 
covariance with some benchmark portfolio. In addition to capturing the 
univariate temporal dependencies in conditional variances, MGARCH 
models also specify how covariances vary through time. Comprehensive 
reviews of this expanding literature can be found in Bauwens, Laurent and 
Rombouts (2006), Brooks (2006) and Silvennoinen and Terdasvirta (2007). 
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(a) | Accumulated response of Ap to generalised one (b) Accumulated response of Ad to generalised one 
S.D. innovations S.D. innovations 


(c) | Accumulated response of Ars to generalised one (d) Accumulated response of Ar20 to generalised one 
S.D. innovations S.D. innovations 
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—Ap -—-——Ars 
Sa Ad —-—Ar20 
Figure 8.1 Accumulated generalised impulse response functions 


The basic framework for MGARCH modelling was first introduced by 
Bollerslev, Engle and Wooldridge (1988), who extended the univariate 
GARCH model to include a vectorised conditional covariance matrix. 
Although the unrestricted model, known as the VECH-GARCH, is very 
general, it is practically infeasible to estimate for more than two variables 
because of the large number of parameters contained within it. Conse- 
quently, Bollerslev, Engle and Wooldridge, and others, have proposed a 
number of restrictions and specifications to reduce the number of par- 
ameters needed to be estimated. 

A popular modification is that of Bollerslev (1990), who assumes that 
the conditional correlation between the observed variables (or disturbances) 
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is constant through time. To ease exposition, we shall assume that the 
observed return series have zero means and that the conditional variance 
structure is limited to a single lag. Thus, if we have the vector of k returns 
X, then we allow the conditional variance of x, to vary through time 
according to 


V(x l-1) — Q; 


where I,_, is the information set available at time t— 1. In the constant 
correlation MGARCH(1, 1) model, the conditional variances, o} p and the 
covariances, oj, of Q, are given by 


2 i ye 2 SE 
Oit = Oi + QiXi t1 + BO; t 1 = 1,2, ariza k 
Oit = PFO 1<i,j<k 


where the p; are the constant correlations. It must also be assumed that 
0,8; >0, that a;+ 6;<1 for all i=1,2,...,k, and that the matrix of 
correlations is positive definite. Although the conditional correlation is 
constant, the model allows for time variation in the conditional covariances. 
Estimation is somewhat simpler than the more general MGARCH specifi- 
cations since the sample correlation matrix of standardised residuals, which 
is by default positive definite, is the ML estimator of the correlation matrix. A 
convenient reduction in the number of parameters required to be estimated 
results from the fact that the correlation matrix can be concentrated out of 
the log-likelihood function. 

In a more general setting, Ling and McAleer (2003) study the asymptotic 
theory of vector ARMA-GARCH processes, which include the Bollerslev 
(1990) model as a special case. They establish conditions for strict statio- 
narity, ergodicity and higher-order moments for such models and prove the 
consistency and asymptotic normality of the QML estimator under certain 
conditions. Issues related to estimation and software packages that can be 
used for estimation are reviewed by Brooks, Persand and Burke (2003). To 
test the adequacy of the constant correlation MGARCH model, Bollerslev 
(1990) suggests using a portmanteau statistic on the cross-products of the 
standardised residuals across different equations, and auxiliary regressions 
that include products of standardised residuals. To relax the somewhat 
restrictive assumption of time-invariant correlations a number of alternative 
models have been proposed, but, typically, these can only be used at great 
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computational expense. For example, Tse (2000) proposes the following 
specification for the conditional correlations and covariances: 


Pit = Pig + VYijXit-1%j,t—1 


Cys = Py Oxo} 1Si<j<k 


In this way the correlations are allowed to respond to the products of 
previous observations. Since the y; must also be estimated, however, the 
total number of parameters now becomes K +2k. Moreover, computa- 
tionally cumbersome restrictions must be imposed on the y; to ensure that 
the conditional variance matrices remain positive definite. Since this 
extended model contains the previous one as a special case, Tse (2000) 
develops an LM test for the null hypothesis of constant correlation, which 
requires that y;=0, 1<i<j<k. 

MGARCH models have found significant applications in both industry 
and academia. For example, they have been used to study possible co- 
movements and spillovers between volatilities in different assets and mar- 
kets, to obtain time-varying asset pricing models and hedge ratios, and to 
model more effectively the risk in portfolios of assets. Bauwens, Laurent and 
Rombouts (2006) provides references to key applications in this literature. 


Example 8.9 MGARCH estimation of FTSE optimal hedge ratios 
There has been a good deal of interest on the use of futures contracts as a 
hedging instrument against possible adverse movements in financial mar- 
kets. The number of futures contracts needed to hedge a cash market pos- 
ition is calculated using the hedge ratio. Despite almost three decades of 
research, the formulation and implementation of an optimal hedge strategy — 
or, more precisely, the appropriate way to calculate the optimal hedge ratio — 
remains a controversial issue in finance and econometrics (see, for example, 
Brooks and Chong, 2001). The earliest and probably simplest approach 
estimates a static hedge ratio from the slope coefficient obtained by an OLS 
regression of spot returns, sp on futures returns, f} This static optimal hedge 
ratio can be calculated simply as 8 = øs / OF where ø, is the covariance 
between spot and futures price returns and oF is the variance of futures price 


returns. 

Since returns are known to exhibit time-varying covariances and vari- 
ances, several alternative estimation methods have been developed to model 
dynamic hedge ratios. In this case, at time t— 1, the expected return and 
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variance of a portfolio consisting of one unit of the asset and £ units of the 
futures contract will be E(s; — B,_,E(ft)) and o, + Bi1% 4 E Briley p 
respectively. The conditional (and hence time-varying) variances of spot and 
futures returns are denoted here by o2, and oF p While the conditional cov- 


ariance is Ost The dynamic optimal hedge ratio that minimises the variance 
of the spot and futures portfolio returns will then be B,_; = —oy,1/ OF p 

To demonstrate the estimation of the optimal hedge ratio, we employ 
daily spot prices and respective futures contracts prices for the FTSE for the 
period 2 January 2003 to 31 December 2003. After removing non-trading 
days and matching trading dates between spot and futures prices, 253 
observations were available. There are four delivery months for the FTSE 
futures contract: March, June, September and December. Due to the size of 
the market, at least two contracts were traded at any time, which facilitates 
contract rollover. 

Using OLS, the static optimal hedge ratio regression was estimated 
between logarithmic returns of spot and futures prices to be 


sı = 0.987 fi 
(0.011) 


with the Newey—West standard error being shown in parentheses. The 
constant was omitted since it was found to be statistically insignificant. As 
expected, tests for heteroskedasticity in the residuals suggest that the 
regression is misspecified. For example, an LM test for ARCH(1) produced 
an F-statistic of 24.1, which is highly significant. 

Using the relevant program provided in EViews version 5.0, we imple- 
mented the restricted version of the bivariate BEKK-MGARCH model 
proposed by Engle and Kroner (1995) and named after an unpublished paper 
by Baba, Engle, Kraft and Kroner. This modelling approach resolves the 
problem of ensuring the positive definiteness of the covariance matrix by 
using a quadratic form for the conditional covariance equations. The esti- 
mated models for the conditional variances of spot and futures returns and 
the conditional covariance are 


o? , = 0.0021? + 0.9227 02, , + 0.331 e2, 
, (0.0006) (0.020) ” (0.043) ” 


7 3 2 o 2 
= 0.0005? + 0.916 0.343 
Fi = aw oaa A t n T 


Oy, = 0.0021° - 0.0005° + 0.9227 - 0.9167 of +1 + 0.331? - 0.343 EF, EF 14 
j (0.0006) (0.0001) (0.020) (0.022) (0.043) (0.047) ” Í 


t 
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Estimated dynamic hedge ratio for FTSE futures contracts during 2003 


where £+ and £s are the residuals from the conditional mean equation for 
the spot and futures returns, respectively. An LR test clearly supports the 
MGARCH specification, as can also be seen from the highly significant 
parameter estimates in the condition variance and covariance equations. 

The dynamic hedge ratio obtained by this approach is shown in figure 8.2. 
Although it has an average value of 0.985, very close to the static hedge ratio 
estimated via OLS, it varies throughout the sample between 0.889 and 1.087. 
In accordance with many empirical studies, however, we find that, despite 
OLS being misspecified, it nevertheless produces a smaller variance of 
portfolio returns. More specifically, we find that the static approach pro- 
duces a variance of portfolio returns (4.70107 °) that is slightly smaller 
than that of the dynamic (5.07 x10 ~°). Moreover, even a naive static 
hedging approach of purchasing one futures contract (= 1) produces a 
smaller variance for the hedged position returns (4.72 10~ °). The useful- 
ness of hedging is clearly demonstrated, however, by the unhedged position 
(6 =0) having a much larger variance of 1.50 107 *. 
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Chapter 8 has developed regression techniques for modelling relationships 
between non-integrated time series. As we have seen in earlier chapters, 
however, many financial time series are integrated, often able to be charac- 
terised as I(1) processes, and the question thus arises as to whether the 
presence of integrated variables affects our standard regression results and 
conventional procedures of inference. This question is long-standing, since it 
has been known since Yule (1897) that an unremoved deterministic time 
trend could produce erroneous regression results by acting as a common but 
non-causal influence behind otherwise independent time series. Later, Yule 
(1926) was the first to explore directly the problem of ‘nonsense correl- 
ations’, arguing that these resulted from violations of the assumptions 
behind linear correlation, in particular that of serial independence. Through 
analytical examples, Yule showed that estimated correlations can be sig- 
nificantly biased if the underlying variables are polynomials of time. He also 
performed a set of impressive hand-calculated Monte Carlo experiments that 
demonstrated that nonsense correlations could also arise when analysing the 
relationships between pairs of I(1) or I(2) variables. Soon afterwards, Slutsky 
(1937) and Working (1934) were able to argue that random walk processes 
could produce conspicuous, yet erroneous, cyclical behaviour. Indeed, 
Working (1934, p. 11) expressed a view that, unfortunately, was ignored for 
many years: “Economic theory has fallen far short of recognising the full 
implications of the resemblance of many economic time series to random- 
difference series; and methods of statistical analysis in general use have given 
these implications virtually no recognition.’ Only much later, beginning in 
the early 1970s, were these concerns addressed seriously by modern time 
series econometricians. 

The (re)starting point of this research programme can be traced to the 
paper by Granger and Newbold (1974) on the possibility of “spurious 
regressions’, which then led naturally to the concept of cointegration (Engle 
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and Granger, 1987). This attempts to capture stochastically the idea of long- 
run equilibrium between a set of non-stationary variables. The intuition 
underlying cointegration was certainly not new, being related to the error 
correction mechanisms introduced by Phillips (1957) and given empirical 
content by Sargan (1964). Theoretical arguments for the existence of coin- 
tegrating relationships include market segmentation, non-instantaneous 
market clearing and the behaviour of steady-state solutions to intertemporal 
optimisation problems (see Pesaran, 1997). As noted by Markellos and Mills 
(1998), financial practitioners have long been aware, since at least the early 
1900s, that certain trending financial time series have some form of long-run 
relationship, although they also realised that scaling transformations were 
required to be able to express these regularities as interpretable financial 
ratios that deviate around some average (equilibrium) value. Broadly 
speaking, in this context cointegration analysis can be viewed as a way of 
formalising financial ratio analysis and forecasting. 

Section 9.1 thus analyses spurious regressions between integrated time 
series. This leads naturally on to the concept of cointegration, which is 
introduced in section 9.2. Testing for cointegration in regression models is 
discussed in section 9.3, and the estimation of cointegrating regressions is the 
subject material of section 9.4. Section 9.5 considers VARs containing 
integrated and, possibly, cointegrated variables, which enables us to develop 
the vector error correction model (VECM) framework. Causality testing in 
VECMs is discussed in section 9.6, and impulse response functions are 
analysed within a VECM framework in section 9.7. Section 9.8 focuses on the 
case when there is just a single long-run (cointegrating) relationship, which 
enables a simpler analysis to take place, and, finally, section 9.9 sets out a 
framework for modelling common trends and cycles. 


9.1 Spurious regression 


We begin by considering the simulation example analysed by Granger and 
Newbold (1974) in an important article examining some of the likely 
empirical consequences of nonsense, or spurious, regressions in economet- 
rics. They consider a situation in which y, and x, are generated by the 
independent random walks 


Yt = Vr-1 + Vrt, Xt = X%-1 + Wr, t= ly Qe cece 
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where v, and w, are independent white noises. The regression of y, on a 
constant and x; is then fitted: 


yi = ûr + pe, t=1,2,...,T (9.1) 


With T=50, yp = x= 100 and v, and w, drawn from independent N(0, 1) 
distributions, Granger and Newbold report a rejection rate of 76 per cent 
when testing the (correct) null hypothesis that 6 =0 in the regression (9.1) 
using the conventional t-statistic for assessing the significance of Br. 
Moreover, when five independent random walks are included as regressors 
in a multiple regression, the rejection rate of a conventional F-statistic testing 
that the coefficient vector is zero rises to 96 per cent. For regressions 
involving independent ARIMA(0,1,1) series the corresponding rejection 
rates are 64 per cent and 90 per cent, respectively, and Granger and Newbold 
thus conclude that conventional significance tests are seriously biased 
towards rejection of the null hypothesis of no relationship, and hence 
towards acceptance of a spurious relationship, when the series are generated 
as statistically independent integrated processes. 

Moreover, such regression results are frequently accompanied by large R? 
values and highly autocorrelated residuals, as indicated by very low Durbin- 
Watson (dw) statistics. These findings led Granger and Newbold (1974) to 
suggest that, in the joint circumstances of a high R? and a low dw statistic (a 
useful rule being R? > dw), regressions should be run on the first differences 
of the variables. Further empirical evidence in favour of first-differencing in 
regression models was provided by Granger and Newbold (1986, pp. 205-15) 
and Plosser and Schwert (1978). 

These essentially empirical conclusions have since been given an analytical 
foundation by Phillips (1986), who makes much weaker assumptions about 
the innovations El =v wy)! than those made above. In fact, Phillips 
assumes that £, follows a multivariate version of the conditions (3.10) used to 
develop non-parametric unit root tests, and which were also employed in the 
(stationary) regression framework of chapter 8, section 1 — i.e. 


e E(é,) = 0 for all t (9.2) 


© sup it E( El?) < oo for some 6 >2,i= 1,2(€,, = vy, E = we) 
(9.2b) 
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e Z= lim rro T'E (SrS})exists and is positive definite, 


9.2c 
where Sr = ae E a 


e &, is strong mixing (9.2d) 


In the special case when v, and w; are independent, the ‘long-run’ covariance 
matrix Us is 


o? = lim T'E(P}), o°, = lim T'E(Q}) 


T-0o 


t t 
P =$ aY Q=) aW P=Q=0 


Phillips (1986) shows that, under these conditions, suitably standardised 
sample moments of the sequences {y,} and {x;}>° weakly converge to 
appropriately defined functionals of Brownian motion, rather than to con- 
stants as in the non-integrated regressor case discussed in chapter 8, which 
assumes that y, and x, are, for example, ergodic. As a consequence, the 
standard distributional results of least squares regression break down, since 
they are based on the ratios of sample moments converging to constants. 
While not providing too great a level of rigour, a sketch of the derivation of 
this crucial result is nonetheless illuminating. We begin by noting that we 
may write y, = P,+ yọ and x,= Q, + xo, where the initial conditions yọ and xo 
can either be constants or can have certain specified distributions, from 
which we construct the standardised sums (recall the development in 
chapter 3, section 2.1) 


Yr(r) = To) Pyr) = T07 P; 
Xr(r) = To Qr = To Q 
G-1)/T<r<j/T, j=1,...T 


Using the more general partial-sum process S„ we can also construct 
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Zr(r) = TE Syr = TA S 


where Dy ? is the positive definite square root of Ug. Phillips (1987c) 
proves that, as T— œ, Z7(r) converges weakly to the vector Brownian 
motion Z(r) — i.e. 


Zr(r) = Z(r) 

From the properties of Brownian motion, Z(r) is multivariate normal, with 
independent increments (so that Z(s) is independent of Z(r) — Z(s) for 
0<s<r< 1) and with independent elements (so that the ith element Z;(r) is 


independent of the jth element Z(r), i#j). 
When the sequences v, and w, are independent, 


a= [eh 2 [00] 
and hence 
Yr(r) = V(r), Xr(r) = W(r) 


as T— oo, where V(r) and W(r) are independent Brownian motions. Phillips 
(1986) then proves the following results: 


(i) ovo! (Jy VW (nar fi V(r)dr J} W(r)ar) 
T= 
fo W(r)'dr — (fo War) 
—1 SVW 
= cit 


SVW 


1/2 
(ii) T / ti, > 772 
(Svvsww — Svw) 


2 

oes S 

(iii) Rs YW 
SVVSWW 


(iv) dw 0 
P 


where we use the notation Sap = i a(r)b(r)dr — i a(r)dr IN b(r)dr. 
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As Phillips (1986) remarks, these analytical results go a long way towards 
explaining the Monte Carlo findings reported by Granger and Newbold 
(1974). Result (i) shows that, in contrast to the usual results of regression 
theory, Êr and, similarly, ĉr, do not converge in probability to constants as 
T— oo. Br has a non-degenerate limiting distribution, so that different 
arbitrary large samples will yield randomly differing estimates of 6. The 
distribution of år (not shown) actually diverges, so that estimates are likely 
to get further and further away from the true value of zero as the sample size 
increases. Thus, the uncertainty about the regression (9.1) stemming from its 
spurious nature persists asymptotically in these limiting distributions, being 
a consequence of the sample moments of y, and x, (and their joint sample 
moments) not converging to constants but, upon appropriate standardisa- 
tion, converging weakly to random variables. 

Result (ii) shows that the conventional f-ratio on Br (and similarly for 
ûâr,) does not have a t-distribution, and indeed does not have any limiting 
distribution, diverging as T— oo so that there are no asymptotically correct 
values for these tests. We should thus expect the rejection rate when these 
tests are based on a critical value delivered from conventional asymptotics 
(such as 1.96) to continue to increase with sample size, and this is consistent 
with the findings of Granger and Newbold. 

Results (iii) and (iv) show that R? has a non-degenerate limiting distri- 
bution and that dw converges in probability to zero as T— oo. Low values 
for dw and moderate values of R are therefore to be expected in spurious 
regressions such as (9.1) with data generated by integrated processes, again 
confirming the simulation findings reported by Granger and Newbold. 

These results are easily extended to multiple regressions of the form 


Yi = Gr + yx: + ih (9.3) 


where x,= (xip ..-5 Xg) is a vector of I(1) processes. Phillips (1986) shows 
that analogous results to (i) to (iv) above hold for (9.3) and, in particular, 
that the distribution of the customary F-statistic for testing a set of linear 
restrictions on J diverges as T— œo, so that there are no asymptotically 
correct critical values for this statistic either. Moreover, the divergence rate 
for the F-statistic is greater than that for individual t-tests, so in a regression 
with many regressors, therefore, we might expect a noticeably greater 
rejection rate for a ‘block’ F-test than for individual t-tests or for a test with 
fewer regressors, and this is again consistent with the results reported by 
Granger and Newbold. 
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We should emphasise that, although the derivation of the asymptotic 
results has assumed the independence of y; and x, so that the true values of a 
and p are zero, this is not crucial to the major conclusions. Although the 
correlation properties of the time series do have quantitative effects on 
the limiting distributions, these being introduced via the parameters of the 
limiting covariance matrix Ys in the bivariate regression analysed in detail 
above, such effects do not interfere with the main qualitative results: that &r 
and By do not converge in probability to constants, that the distributions of 
F- and t-statistics diverge as T— 00, and that dw converges in probability to 
zero whereas R? has a non-degenerate limiting distribution as T— oo. 
Hamilton (1994, chap. 19.3) provides a detailed treatment of the spurious 
multiple regression model. 

A Monte Carlo simulation similar to that of Granger and Newbold (1974) 
enables us to interpret these results in a perhaps more transparent fashion. 
The independent random walks y, and x, were generated for a sample now of 
size T= 1000, v, and w, were again drawn from independent N(0,1) popu- 
lations and yo = xọ = 0, using 10,000 iterations. Figures 9.1 to 9.4 present the 
density functions of Biooo» its associated t-ratio, and the R? and dw statistics. 
The distribution of B,o99 is almost normally distributed (a central limit 
theorem does, in fact, hold as the simulations use independent replications). 
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Although the sample mean is —0.0052, the sample standard deviation is 
0.635, confirming that, for large T, the distribution does not converge to a 
constant and different samples produce very different estimates of 6, the 
range of estimates being approximately +3.0. 

The distribution of the ratio, shown in figure 9.2, is again normal but 
with a standard deviation of 23.62. The 5 per cent critical values from this 
distribution are +48.30, while using +1.96 would entail a rejection rate of 
93.4 per cent. The distribution of the R? statistic has a mean of 0.24, a 
standard deviation of 0.23 and a maximum value of 0.94, while that for the 
dw statistic has a mean of 0.018, a standard deviation of 0.011 and a max- 
imum value of only 0.10. (Note that the smoothing involved in constructing 
the density functions leads to negative values in the left-hand tails of these 
two distributions; the actual minimum sample values of F? and dw are, of 
course, positive, although extremely small, being 0.0008 for dw and of the 
order of 10 '° for R°.) Both sampling distributions thus illustrate the the- 
oretical predictions of Phillips’ (1986) analysis. 

It should be emphasised that, in the general set-up discussed here, where 
both y; and x; are I(1) processes, the error, up since it is by definition a linear 
combination of I(1) processes, will also be integrated, unless a special 
restriction (to be discussed subsequently) holds. Moreover, the usual 
respecification of the model to include y,_ as an additional regressor on the 
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finding of a very low dw value will have pronounced consequences: the 
estimated coefficient on y,_; will converge to unity, while that on the 
integrated regressor(s) will converge to zero, thus highlighting the spurious 
nature of the static regression. 

Indeed, the spurious nature of the regression is, in fact, a consequence of 
the error being I(1). Achieving a stationary, or [(0), error is usually a min- 
imum criterion to meet in econometric modelling, for much of the focus of 
recent developments in the construction of dynamic regression models has 
been to ensure that the error is not only [(0) but white noise. Whether the 
error in a regression between integrated variables is stationary is therefore a 
matter of considerable importance. 


9.2 Cointegrated processes 


As just noted, a linear combination of I(1) processes will usually also be I(1). 
In general, if y, and x, are both I(d), then the linear combination 


Ur = Vp — AX (9.4) 


will usually be I(d). It is possible, however, that w, may be integrated of a 
lower order, say I(d— b), where b>0, in which case a special constraint 
operates on the long-run components of the two series. If d= b= 1, so that y; 
and x, are both I(1) and dominated by ‘long-wave’ components, u, will be 
(0), and hence will not have such components; y, and ax, must therefore 
have long-run components that cancel out to produce u, In such circum- 
stances, y, and x, are said to be cointegrated; we emphasise that it will not 
generally be true that there will exist such an a that makes u~ I(0), or, in 
general, I(d— b). 

The idea of cointegration can be related to the concept of long-run 
equilibrium, which can be illustrated by the bivariate relationship 


Vt = aXt 
or 
Vt — ax, = 0 


Thus, u, given by (9.4) measures the extent to which the ‘system’ is out of 
equilibrium, and it can therefore be termed the ‘equilibrium error’. 
Assuming that d= b= 1, so that y, and x; are both I(1), the equilibrium error 
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will then be 1(0) and u, will rarely drift far from zero, and will often cross 
the zero line. In other words, equilibrium will occasionally occur, at least to 
a close approximation, whereas if y, and x; are not cointegrated, so that 
u,~ I(1), the equilibrium error will wander widely and zero crossings would 
be very rare, suggesting that under these circumstances the concept of 
equilibrium has no practical implications. 

How is the concept of cointegration linked to the analysis of spurious 
regressions? Condition (9.2c) on the innovation sequence é, requires that the 
limiting covariance matrix Xs be non-singular. If we allow £s to be singular, 
the asymptotic theory yielding the results (i) to (iv) no longer holds. In 
general, we have 


Oo Ow 
a= (a % 


Ow OO, 


so that, for Us to be singular, we require |Xs| = 0207, — 07, = 0. This 
implies that Xsy = 0, where y! =(1,— a)anda = Ow / o?,. Singularity of Us 
is a necessary condition for y, and x, to be cointegrated (Phillips, 1986; 
Phillips and Ouliaris, 1990), since in this case | Us| = 0 implies that the ‘long- 
run’ correlation between the innovations v, and w, given by pyw = Oyw/O Ow 
is unity. For values of p,,, less than unity, y, and x, are not cointegrated, and 
when /,,,=0, so that v, and w, are independent, we have Granger and 
Newbold’s (1974) spurious regression. 

What differences to the asymptotic regression theory for integrated 
regressors result when y; is cointegrated with x}? Since the equilibrium error 
us can be regarded as the error term in the regression of y, on x, we may 
consider first the model 


Yi = Bx + Uy (9.5a) 
where 
Xp = T + X1 + Wr (9.5b) 


and where u, and w, are contemporaneously correlated white noise — i.e. 
Eluw) = O wy 
The OLS estimator of £ is 
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Now, if 2 =0, then, since x,~ I(1), ar sA needs to be scaled by T? for it 
to converge to a finite value, whereas D x;U; just requires scaling by T ' 
for it to converge to a finite value (see chapter 3, section 2.1). Thus, 


T(Br— A) = (TE mm) ( TE A 


converges to zero — i.e. Êy converges to fat the rate T. Contrast this with the 
standard regression case, when x,~ I(0); now 5S x? only needs scaling by 
T ' and we have 


pe (Êr — B) = ee ae x) Ga De a) 


i.e. B converges to f at the rate T7". The faster rate of convergence under 
cointegration is known as the super-consistency property (Stock, 1987) and 
implies that, even though E(x,u,) may be non-zero through o,,,, being non- 
zero, there is no asymptotic endogeneity bias. 

Although Êr is super-consistent, however, it is not necessarily asymp- 
totically unbiased or normally distributed. To obtain the limiting distribu- 


tion of Be and its t-ratio, we condition u, on w, through 


Ut = YW, + Vr, Y = Cw / 0, o? = 0? - 0? „IÈ, (9.6) 


so that a non-zero contemporaneous correlation between the innovations u 
and w, and hence endogeneity between y, and x,, may be incorporated. The 


limiting distribution of Br — $ can then be written 


T(Êr — B) > (v/2)(W(y-+41) (f w wr par) +((ou/on)?-7)N(O, 1) 


while that of the t-ratio is 


> (Pyy/2)(W(1)?+1) (f wi W(r Par) 0-A) NOD 


where Puw =O uwl Ou w. In general, therefore, these limiting distributions will 
not have standard normal distributions unless y = puw= 0, which is the 
condition for strong exogeneity of x, When this condition does not hold, the 
first terms in the limiting distributions give rise to ‘second-order’ endo- 
geneity bias (Phillips and Hansen, 1990), which, although asymptotically 
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negligible in estimating £ because of super-consistency, can be important in 
finite samples. 

These theoretical results can also be demonstrated via Monte Carlo 
simulation. The model given by (9.5) was used with 6 = 2 = 0 and with the 
settings a, = o? = 1 and o,,,= 0.75, so that y = 0.75 and puw = 0.57. With, 
once again, T= 1000 and 10,000 iterations, figure 9.5 shows the simulated 
frequency distribution of ,o9). The sample mean is 0.0028, and 95 per cent 
of the estimates lie in the interval (—0.0016, 0.0093), reflecting the super- 
consistency property. This interval also shows the skewness of the distri- 
bution, however — i.e. the presence of second-order endogeneity bias caused 
by the lack of strong exogeneity of x, Figure 9.6 shows the simulated t-ratio. 
Since y is non-zero, the distribution will not be standard normal: although 
normal in shape, it is centred on 0.994 with a standard deviation of 0.884. 

Figures 9.7 to 9.9 show the results of three related simulations. Figure 9.7 
shows the simulated frequency distribution of the slope coefficient of the 
regression of y, on x, when x; is generated by the stationary AR(1) process 
x,=0.5x,_, + w, rather than the random walk of (9.5b), but when all other 
settings remain the same. The endogeneity bias is now readily apparent, with 
the distribution, although normal, having a mean of 0.565 and a standard 
deviation of 0.035. Figure 9.8 shows the simulated frequency distribution of 
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Figure 9.5 Simulated frequency distribution of Boao from the cointegrated model with endogenous regressor 
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the slope coefficient in the same stationary regression but where now 
O„w=Q, so that there is no endogeneity; consequently, the distribution is 
centred on zero. Finally, figure 9.9 shows the frequency distribution of Boon 
from the cointegrated model but with o„„=0. With no endogeneity, the 
distribution is normal, as compared to figure 9.5, but has a standard error of 
0.0035, thus reflecting the super-consistency property of cointegrated 
regressions when compared to its stationary counterpart in figure 9.8. 

The assumption made in all these simulations, that x, is without drift 
(2 =0), is not innocuous, however, for when x, contains a drift 


n=m+) wi =tr+Q 


and we need to consider 


TOR ae Xi = nT~ ar tu, T `% F uQ: 
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Figure 9.9 Simulated frequency distribution of the t-ratio on Bison from the cointegrated model with exogenous 
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West (1988) shows that the probability limits of T% De uQ, and 
T57, x2 are zero and 7°/3, respectively, and that mT% 57, tu; is 
normally distributed with mean zero and variance 70? / 3. Hence, 

T 2 (By — B) = N(0, 302/7) 


so that in these circumstances asymptotic normality does hold, irrespective 


of whether there is endogeneity or not. Thus, consider the model (9.5) with 


x= 1 and, again, 07, = o? = 1 and o „w= 0.75. Since Êiooo and P=0) Êiooo 


should be normally distributed with mean zero and standard deviation 
0.000072. Figure 9.10 shows the simulated frequency distribution of a 
which is indeed approximately normally distributed with a sample mean of 
zero and a standard deviation of 0.00069, and should be compared with the 
skewed distribution for Bjo99 that results from the absence of drift, shown in 
figure 9.5. 

In general, we may consider regressions of the form (9.5) but with a vector 
of I(1), rather than random walk, regressors x, that may contain drifts, and 
with u,~ I(0) rather than white noise, so that the sequence e} = (us, w+) of 


t 
joint innovations may be assumed to satisfy the conditions (9.2). When the 
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regressor vector x; is without drift, Phillips and Durlauf (1986) show that 


super-consistency again holds for the OLS estimator Êr of the coefficient 
vector f. They then go on to consider testing general linear hypotheses of the 
type considered in chapter 8, section 1.2. The limiting distribution of the 
Wald statistic (8.5), which is chi-square for non-integrated regressors, now 
contains nuisance parameters even if u, is white noise, and is non-normal 
and asymmetric. 

When some of the regressors have drifts, Park and Phillips (1988) show 
that super-consistency of Br again results. Unlike when there is just a single 
regressor, however, the limiting distribution of T (Br = p) is both non- 
normal and singular, since the regressors will be perfectly correlated 
asymptotically. This is because an I(1) variable with drift can always be 
expressed as the sum of a time trend and an (1) variable without drift — e.g. 


Ax =TH+w=x+tat+Ax,, AX, = wi 


so that the correlation between two such variables will be dominated by their 
trends rather than by the driftless (1) components. This suggests that these 
variables should be detrended and a time trend added to (9.5a). The esti- 
mator of the coefficient of the trend will be asymptotically normal, while 
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the estimators of the coefficients on their driftless components will have the 
non-standard distribution discussed above. That normality occurs when 
there is just one regressor may be explained by noting that the non-zero drift 
x imparts a trend into the regression. It is the trend coefficient, 7, that is 
asymptotically normal, and this allows the result on fr to follow. 

When there are two or more integrated regressors with drift, the trend 
coefficient becomes a linear combination of the different drifts, and only this 
combination can be identified and is asymptotically normal. The vector Br 
can be estimated only by the coefficients on driftless I(1) regressors, and this 
will have the non-standard limiting distribution. If all the regressors are 


strongly exogeneous, Jr will once again be asymptotically normal on appro- 
priate standardisation. 

If a time trend is included as an additional regressor in (9.5a) then Park 
and Phillips (1988) show that the asymptotic results for the least squares 
estimators remain valid, although the estimator of the coefficient on the time 
trend depends on x. Furthermore, if additional stationary regressors are 
included in (9.5a) then their coefficients will be asymptotically normal. 


9.3 Testing for cointegration in regression 


Given the crucial role that cointegration plays in regression models with 
integrated variables, it is important to test for its presence. A number of tests 
have been proposed that are based on the residuals from the cointegrating 
regression 


uy = yr — Ar — fix; (9.7) 


Such residual-based procedures seek to test a null hypothesis of no coin- 
tegration by using unit root tests applied to i,. Perhaps the simplest test to 
use is the usual Durbin—Watson dw statistic, but, since the null is that i; is 
I(1), the value of the test statistic under the null is dw=0, with rejection in 
favour of the I(0) alternative occurring for values of dw greater than zero 
(Sargan and Bhargava, 1983; Bhargava, 1986). 

As is well known, the conventional critical values of the dw statistic depend 
upon the underlying processes generating the observed data, and Engle and 
Granger (1987) and Phillips and Ouliaris (1988) provide critical values, for 
various sample sizes and generating processes, in the ‘non-standard’ case 
considered here. Unfortunately, there are several difficulties associated with 
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this simple test: under the no cointegration null the asymptotic distribution 
of dw depends on nuisance parameters such as the correlations among Ax; 
the critical value bounds diverge as the number of regressors increases, 
becoming so wide as to have no practical value for inference; and the statistic 
assumes that under the null u, is a random walk, and under the alternative u, 
is a stationary AR(1) process. If this actually is the case, then Bhargava (1986) 
shows that dw has excellent power properties, but the critical bounds will not 
be correct if there is higher-order residual autocorrelation. 

Engle and Granger (1987) therefore prefer to use the +ratio on i_; from 
the regression of At, on i,_; and lagged values of Aw,, in a manner 
analogous to the unit root testing approach for an observed series discussed 
in chapter 3 (see, for example, equation (3.8)). The problem here is that, 
since îĉ is derived as a residual from a regression in which the cointegrating 
vector is estimated, and since if the null of non-cointegration was true such a 
vector would not be identified, using the t,, critical values would reject the 
null too often, because least squares will seek the cointegrating vector that 
minimises the residual variance and hence is most likely to result in a sta- 
tionary residual series. Moreover, an additional factor that influences the 
distribution of the t-ratio is the number of regressors contained in x,. Critical 
values are again available from many sources (see, for example, Hamilton, 
1994, table B.9, and Banerjee et al., 1993, table 9.1). For example, the large T 
5 per cent, 2.5 per cent and 1 per cent critical values when x, = x, are —3.37, 
—3.64 and —3.97, respectively. 

As with conventional unit roots tests, more extensive critical values than 
those given in standard tables are available in most econometric packages, 
again obtained using the response surfaces computed by MacKinnon (1991). 
For example, when x,= x, so that there are n=2 variables in (y, x,), the 
1 per cent critical values, denoted C,(T), are calculated using 


Coi(T) = —3.900 — 10.534T~' — 30.03T~* 


MacKinnon (1996) can be consulted for details of how to obtain p-values for 
a wide range of sample sizes. As with the conventional unit root tests, dif- 
ferent sets of critical values are to be used either if there is no constant in the 
cointegrating regression or if there is a constant and trend (corresponding to 
the t and t, variants). Non-parametric variants may also be constructed (see 
Phillips and Ouliaris, 1990). 

Tests may also be derived using the error correction model (ECM) 
representation of a cointegrated system. Consider again the model given by 
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(9.5) and (9.6) with 2 =0 and where u, is now generated by a stationary AR 
(1) process: 


Yi — Bx, = Uy 
Ut = PU;-1 + Eit, |o|<1 (9 8) 
Ax; = Wi l 
Ut = YW; + Vy 
This can be written as 
Ay; — BAX, = (p i 1)y:—1ı m ple = 1) x;-1 TE 
Ayi — VAX = Ex 
where €2;= v;+ Bw: — xt—1. Thus, 
Ay] _ [1 -B] [lo 1y- Blo — xa ter 
Ax; l1-y E2t 
or 
| -1 k — p)y1ı — yBO — a e 
= (B- + 
be aN) (1 — p)y:-1 — (1 — p)xe-1 Čat 
where 
fit | -apana Nal Bear — VE 
[a (B-y) | Err — Elt 
This leads to the ECM representation 
Ay; = by (Y-1 = Bx;-1) + Oy, = YU t+ Oy, = OiU + oy, (9 9) 


Ax, = (yi1 — Bxp-1) + Sop = Op + Sop = Onur + Say 


where we let 5=(B—y) ‘(1 — p). From the ECM representation (9.9), ô is 
non-zero if and only if p is not equal to one, but p= 1 is the condition that 
ensures that both u, and w, are random walks, in which case y, and x, cannot 
be cointegrated — i.e. if o=1 there does not exist a 6 that makes the linear 
combination of y, and x; stationary. 

The tests discussed above investigate the null hypothesis p= 1 using the 
residuals from the cointegrating regression (9.7), but an alternative is to test 
either of the nulls 6;=0 and 02 =0, or the joint null 6; =6,=0, which 
would be more efficient given that the cross-equation restriction in (9.7) 
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implies that the error correction u,_, enters both equations of the ECM. 
There is a problem, however: since £ is unknown, it must be estimated from 
the data. If o =1 is valid, however, 6 is unidentified and the ECM (9.9) is 
invalid. Only if y, and x, are cointegrated can 6 be estimated from the 
cointegrated regression, but a test must be based upon the distribution of the 
statistic assuming that the null is true. 

A solution to this problem may be found by rewriting the error correction 
equation for y, say, as 


Ay; = 01 (y1 Xr-1) dX) H Cu 


where d=6,(1 — £), so that a test of 6; =0 can be based on its associated 
t-ratio and tests of 6; =0 and 6, =6,=0 can be constructed analogously. 
This statistic will not be asymptotically normal, however, and Banerjee et al. 
(1993, table 9.6) provide fractiles for the simulated distribution for various 
sample sizes; they are slightly closer to zero than those of the corresponding 
residual unit root test. 

How powerful are these tests of cointegration? Banerjee et al. (1993) 
investigate power by conducting a Monte Carlo experiment using two 
alternative data-generating processes for y, and x, The first is the model 
(9.8), where we rewrite the equation for y, as 


Vt = PYi-1 + Bx: — Box:-1 + Ert 


The second is the general dynamic model 


Ve = UYp—-1 + 2X + 3X; + Ut 


from which the first model is obtained by imposing the “common factor 
restriction aa, + œ =Q. Banerjee et al. find that the t-ratio test performs 
better than the unit root test in the absence of a common factor. For further 
analysis of these tests, see Ericsson and MacKinnon (2002). 


Example 9.1 Cointegration and the market model: an example 
of testing for cointegration 
The market model is typically defined as 


Tp,t = + Botmt + ut 


using the notation introduced in example 8.4 (in contrast to the CAPM 
analysed there, the actual returns on a stock or small portfolio in period t, rp,» 


350 


The Econometric Modelling of Financial Time Series 


Table 9.1 Market model cointegration test statistics 


dw C t 
CTLD 0.05 — 1.38 —2.40 
PRU 0.24* —2.92 —3.03 
LGEN 0.14* —3.47” —3.71* 


Notes: dw is the Durbin-Watson statistic from the cointegrating regression. C is 
the unit root test on the cointegrating residuals. t is the t-ratio from the error 
correction model. * denotes significance at the 5 per cent level: critical values are, 
approximately, 0.14 for dw and —3.42 for C and t. 


and on the corresponding market return, fm,» are used rather than excess 
returns). If we assume that either dividends are reinvested or that they are 
ignored completely, then the returns will typically be calculated as 


‘pt = Ayt Tmt = AX 


where y, and x, are the logarithms of the stock price and market index 
respectively. If y, and x; are I(1) then such a specification would be appro- 
priate if the two series were not cointegrated; if they were cointegrated then 
the market model would be misspecified, in that an error correction term, 
¥:—-1 — BX;—1, would be required as an additional regressor. 

We illustrate the possibility of cointegration within the market model by 
using several examples taken from a data set that has been extensively analysed 
by Mills (1996b), Coutts, Roberts and Mills (1997) and Markellos and Mills 
(2003). This data set contains weekly observations on the London Stock 
Exchange FTSE 100 index and on the prices of the fifty-six companies that 
remained constituents of the index throughout the first ten years of its exist- 
ence, January 1984 to December 1993, so that T=521. The relationships 
between the (logarithmic) prices of three of these companies, Courtaulds 
(CTLD), Prudential (PRU) and Legal and General (LGEN), and the FTSE 100 
are analysed in a sequence of examples. (Unit root tests confirm that all are I(1), 
although the FTSE 100 is only marginally so; the possibility that it is (0) — more 
precisely, trend stationary — will be taken into account in a later example.) 

Table 9.1 presents the three cointegration test statistics discussed above for 
each of the three series. There is no evidence of cointegration between CTLD 
and the FTSE 100 and little evidence of cointegration for PRU, but it appears 
that LGEN is cointegrated with the FTSE 100. Figure 9.11 plots each of the 
three series against the FTSE 100 and the lack of cointegration between 
CTLD and the market index is readily apparent. There is much more 
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Figure 9.11 Stock prices and the FTSE 100 
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evidence of common trends in the other two plots, but it would be difficult to 
ascertain whether cointegation does in fact exist from the plots alone, thus 
emphasising the need for formal testing procedures. 


9.4 Estimating cointegrating regressions 


As we have seen, OLS estimation of the cointegrating regression produces 
estimates that, although super-consistent, are nevertheless biased even in 
large samples (recall figure 9.5, which showed a biased sampling distribution 
for Bice when there was endogeneity between y; and x; autocorrelation in us 
will exacerbate the situation further). 

A general set-up that allows for both contemporaneous correlation and 
autocorrelation is the ‘triangular’ system 


yi = Bx: + uy (9.10) 


Ax; = W; 


We assume that ul = (u, w7) satisfies the conditions (9.2). With 
Sr = X`} us then, for r € [0, 1], Ur(r) = T'S irq > U(r) = (ui(r)', 


$ 
U(r)" ) , where U(r) is (1 + k) vector Brownian motion, partitioned con- 
formably with u, and having long-run covariance matrix Xs, defined as 


Es = lim r>% T 'E(SrS}) = lim r> T! X. X E(wu; ) 


t=] s=1 


Since this is the sum of all the covariances of w, and w, it can be decomposed 
into a contemporaneous variance and sums of autocovariances, 


Ys = Ao +A +A. 


where Ay = E (WoW, ) and A = pene E (wow; ). For convenience, we par- 
tition Us as 


with Ag and A partitioned similarly (note that Xj, is a scalar). 
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Park and Phillips (1988) show that the limiting distribution of Êr is 


(6-8) > (f “Uals) uttar) (J Ua(r)AUI C) +A ) 


where Aj, = X21 + A21. It is this term that introduces the second-order bias, 
and, of course, it arises because of the contemporaneous and serial 
dependence of the regressors. Phillips and Hansen (1990) have proposed a 
modification to OLS that eliminates this bias. Define 


y =h- Ep AX; 


which uses any consistent estimator Us of Xs. The fully modified OLS (FM- 
OLS) estimator is defined as 


Bi = (S01 stat - rE) (SOT xl) 


where 


ôt = (1 -Èp dz) )A2 


A> being a consistent estimator of Ay = Do E (wou, ). 
The limiting distribution of the FM-OLS estimator is 


(8; -#) = (f “ tals) ln") | (/ : Us(r}aUia(r)) 


where U,.2(r) is independent of U(r). The use of yE corrects for long-run 
simultaneity, whilst incorporating ôt accounts for any residual auto- 
correlation. This allows conventional chi-square asymptotics to be used for 
inference. For example, the null RA =r may be tested by constructing the 
modified Wald statistic (cf. equation (9.5)) 


We = (RAE — 2)" |R(KTX/T) V} (K"X/T)'R” | (RAF — 1) 


where vi = Si — Wey NS and X is as defined in (9.2). For ye we may 
use the Newey—West estimator (9.4). 

Several other estimators have been proposed that correct for both correl- 
ation between u, and w, and autocorrelation in u, The approaches of 
Saikkonen (1991), Phillips and Loretan (1991), Stock and Watson (1993) and 
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Banerjee, Dolado and Mestre (1998) all suggest augmenting (9.10) with leads 
and lags of Ax, when there is correlation between u, and w; — i.e. estimating 


Y= Bix + ee Y; AXr—s + u (9.11) 


where p is chosen such that the correlation between u, and w;is zero for |s|>p. 
If x, is strongly exogeneous, so that u, does not Granger-cause w,, then the 
leads of Ax, will not be required (y,=0, s < 0). Autocorrelation in u, may be 
captured by assuming that u, follows an AR(p) process and estimating (9.11) 
by generalised least squares, by including lags of Ay, as additional regressors, 


yi = Bx, + es p Axs +Y dAyst tr (9.12) 


or by including lagged values of the equilibrium error y,— $ ' xn 


~p p 
Yr = Bx + eee Ye Axis + X Os(ye-s — BXr—s) + te, (9.13) 


in which case NLS estimation will be required. 
Note that an equivalent form of (9.13) is the ECM 


Ay; = oa yT Ax; s + ee OF (Yi—s — BX:-s) + uy (9.14) 
where 75 = Yo + B, 05 = % — 1 and y= = y, and 6 = 6, for s#0. While all 
these estimators can be shown to be asymptotically efficient, Phillips and 
Loretan (1991) point out that the NLS estimator of (9.13) — or, equivalently, 
(9.14) — has an important advantage over OLS estimation of (9.12). This is 
because, since both y, and x, are I(1), 

N-s = a Ayt-i; Xt-s = S AXi 

if we set initial conditions yọ =x) = 0. Substituting these partial sums into 
(9.13) will produce (9.12) but with the lag length p set equal to t—1. 
Moreover, the lag coefficients will not, in general, decay as the lag increases, 
because the partial sums imply unit weights for individual innovations. 
Thus, in order to model short-run dynamics using the variables Ay,_, and 
Ax;—, it is necessary to include all lags because of this shock persistence, 
which is quite impractical in empirical applications and cannot be justified 
in theory, where lag truncation arguments are needed to develop the 
asymptotics. 
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Example 9.2 Estimating a cointegrated market model 
Example 9.1 found strong evidence in favour of cointegration between LGEN 
and the FTSE 100 index, and this example considers the various estimators of 
the cointegration parameters that result from taking different approaches to 
dynamic modelling. Estimation of the static cointegrating regression by OLS 
obtains 


ye = — 0.036 + 0.988 x; + ty 
(0.063) (0.012) 


Of course, the standard errors shown in parentheses cannot be used for 
inference, but estimation by FM-OLS, using n= [521]' =8 lags in the 
Newey-—West estimator of Xs, produces almost identical parameter estimates 
but considerably larger standard errors: 


yı = — 0.044 + 0.988 x, + iy 
(0.358) (0.067) 


In investigating the alternative dynamic estimators, we first ascertain 
whether x is strongly exogenous. Using four lags, the hypothesis that y does 
not Granger-cause x has a p-value of 0.58, and the inclusion of four leads of 
Ax, in (9.11), after estimation by GLS, is only jointly significant at the 0.34 
level. A parsimonious GLS-estimated model is 


ye = — 0.165 + 1.011 x, + 0.103 Ax; + ty 


(0.284) (0.053) (0.051) 
i, = 0.819 îr + 0.123 ty_3 + a 
(0.034) (0.034) 


which has an equation standard error of 3.18 per cent. Attempts to fit a 
model of the form (9.12) were unsuccessful, for the reasons discussed 
above: the lag coefficients failed to die out, remaining significant at high 
lags, and the error term could not be reduced to white noise — exactly the 
problems that should be produced by the shock persistence caused by the 
unit roots in the system. 

Fitting models of the form (9.13)/(9.14) was successful, however, yielding 


(0.023) (0.075) (0.093) 


+ 0.815(y;-1 — 0.997x-1) + 0.128(y,-3 — 0.997x;-3) + îu 
(0.034) (0.034) 


which has an equation standard error of 3.03 per cent. All the models suggest 
that B= 1, and imposing this restriction leads to 
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Figure 9.12 LGEN relative to the FTSE 100 


Vt x = 0.006 } 0.107 Ax; + 0.815(4,-1 = X1) 
(0.002) (0.056) (0.034) 


+ 0.128(y:-3 = Xt-3) + ih 
(0.034) 
Thus, the price relative, y,— x, plotted in figure 9.12, is stationary, following 
an autoregressive process with one large root (0.943) and a pair of complex 
roots, and it is positively related to the current change in the market index. 


9.5 VARs with integrated variables 


9.5.1 VARs with ((1) variables 


Consider again the VAR(m) process introduced in chapter 8, section 6, 
TI(B)z; = u + v: (9.15) 


where z,, assumed to be I(1), and v, are both nx 1 vectors, conditions (8.20) 
hold, 


I(B) = I, — e T1;B' (9.16) 
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and where we have introduced an nx 1 vector of constants, u. Assuming 
m> 1, the matrix polynomial I1(B) can always be written as 


I(B) = (1, — IIB) — ®(B)B(1 — B) 


where 


m=" T; 


i=1 


and 


®(B) = yo DB, paz eas m 


The ®; can be obtained recursively from ®, = -IH + I, as ®;= ®; + IM, 
i=2,..., m—1. With this decomposition of I(B), (9.15) can always be 
written as 


Z, = O(B)Az,, + u + Mz +v: (9.17) 
or 

Az, = (B) Azı + U+ AZ +v (9.18) 
where 


A=TII—I, = —II(1) 

We consider first the case where 
=I, (9.19) 
so that A=0 and Az, follows the VAR(m— 1) process 
Az, = ®(B)Az,-; +U +V; (9.20) 
The condition II =I,, implies that 
|A| = |, +---+ Mn —1,| =0 (9.21) 


in which case the VAR (9.15) is said to contain at least one unit root. Note, 
however, that (9.21) does not necessarily imply (9.19), and it is this fact that 
leads to cointegrated or reduced-rank VARs, as we shall see later. 

Consider OLS estimation of the levels VAR (9.15) and the differenced 
VARs (9.17) and (9.18) under the assumption that none of the variables 
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making up z, contain drifts, so that u = 0, although constants are included in 
the estimated regressions. The fitted values from (9.15) and (9.17) will be the 
same, because the H; are linked to the ®; by 


ll, =11+ 9, 
Il, = ©; — Ê, i= 2,3,...,m—1 
Th, = -Ên 


Now, from Hamilton (1994, chap. 19.2), the Ê; converge to ®; at rate Sale 
and T'/?(@;— ®;) is asymptotically normal. Since the Ñ;, i> 2, are linear 
combinations of the ®;, T!/? (1; = Il), i> 2, is also asymptotically normal. 
IT, on the other hand, converges to IT at rate T, and, although its asymptotic 
distribution is non-normal, this faster rate of convergence ensures that Il, = 


II + Ô; also converges at rate T"? to an asymptotic normal, as the speed of 
convergence is determined by the coefficients with the slower rate. Hence, if the 
VAR is estimated in levels, then, even though it contains a unit root, conven- 
tional t and F-tests involving a linear combination other than H =H; + +- 
+II,,, have usual asymptotic distributions. For example, tests for determining 
the order of the VAR, since they will not involve I = IT, + --- + IL, will have 
usual x’ distributions. On the other hand, Granger causality tests will involve 
coefficients of TI and will typically not have limiting x’ distributions. 

If there is a drift in z, then the above results still hold, unlike the univariate 
case, where a single regressor with drift makes all coefficients asymptotically 
normal and all F-tests asymptotically x’. 


9.5.2 VARs with cointegrated variables 


Let us now reconsider the case when (9.21) holds, so that A is singular, 
|A| = 0, but A40 and IT¥ 1,,. Being singular, A will thus have reduced rank 
— r, say — where 0 < r< n. In such circumstances, A can be expressed as the 
product of two nxr matrices # and a, both of full-column rank r — i.e. 
A=fa'. To see this, note that a' is the matrix containing the r linearly 
independent rows of A, so that A must be able to be written as a linear 
combination ofa! ; f must then be the matrix of coefficients that are needed 
to do this. These r linearly independent rows of A, contained as the rows of 
a = asg a,)', are known as the cointegrating vectors, and A will con- 
tain only n— r unit roots, rather than the n unit roots that it will contain if 
A= 0, which will be the case if r=0. 
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Why are the rows of a’ known as cointegrating vectors? Substituting 
A= fa’ into equation (9.18) yields 


Az, = ®(B) Az,- + u + Ba! zı + Vv; 


The assumption that z, is I(1) implies that, since Az, must then be (0), a'z 
must also be I(0) for both sides of the equation to ‘balance’. In other words, 
a' is a matrix whose rows, when post-multiplied by z, produce stationary 
linear combinations of z;— i.e. the r linear combinations a@,z, ... , @,Z,are all 
stationary. 

Thus, if z, is cointegrated with cointegrated rank r, then it can be repre- 
sented as the vector error correction model 


Az, = ®(B)Az;_1 + u + Beri +v: (9.22) 


where e,=a'z, are the r stationary error corrections. This is known as 
Granger’s representation theorem (Engle and Granger, 1987), and a detailed 
proof can be found in, for example, Banerjee et al. (1993, chap. 5.3), where 
various additional technical conditions are discussed. 

Several additional points are worth mentioning. The parameters a and f 
are not uniquely identified, since, for any non-singular rx r matrix ¢, the 
products Ba' and Bé(E-'a') will both equal A. If r= 0 then we have already 
seen that the model becomes a VAR(m— 1) in the first differences Az, If 
r= n, on the other hand, A is of full rank, is non-singular, and z, will contain 
no unit roots — i.e. z,1s in fact (0) and a VAR(m) in the levels is appropriate 
from the outset; we are then in the framework of chapter 8, section 6. The 
error corrections e, although stationary, are not restricted to having zero 
means, so that, as (9.22) stands, growth in z; can come about via both the 
error correction e, and the autonomous drift component u. 

Note, however, that, without loss of generality, u can be written as 
u=fy+ By’, where $, is an nx(n-— r) matrix known as the orthogonal 
complement of f, defined such that Bip =0. Since B'u=Pf' pyt+ 


B' By’ =B" By, we have y= (B'B)'B'w and y" = (B1B) “Bas so that, 
geometrically, u has been decomposed in the directions of y and y*. The 


VECM (9.22) can then be written as 


Az, = (B) Azı + 1y“ + Bly teri) +v: (9.23) 


so that if the condition $ 1y“ = 0 holds — i.e. that u = By — then the constant 
enters the system only via the error correction term. 
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How the constant is treated is important in determining the appropriate 
estimation procedure and the set of critical values used for inference. An 
important extension is when a linear trend is included in the VAR: 


T(B)z, =“ + 44t +V: (9.24) 


Here we can write u; = Py; + B,y;, i= 0,1, and the counterpart to (9.23) 
becomes 


Az, = O(B)Az-1 + Bi (75 +7it) +B +71 (t— 1) +e) +v: (9.25) 


In this case the constant and trend will be restricted to the error correction if 
i= By, i=0,1 — i.e. we define the ‘trend-included’ error correction as 
e; = er + Yo + 71t. 

Further implications of the presence of a linear trend are best analysed by 
introducing the infinite-order vector polynomial C(B), defined such that 
C(B)II(B) = (1 — B)I,, and which can be written, analogously to TI(B), as 


C(B) =I, +CB+ (CIB + ŒB? + ...)(1 — B) 
=I + C+ (C + CIB + C)B? + ...)(1 — B) 
= I„ + C + C*(B)(1 — B) 
= C(1) + C*(B)(1 — B) 


The matrices of C(B), Co,C1, ... , are given by the recursions 
C; = a C-l, i>0, GQ=h 

so that 

C=) G=C(1)-1, 

Cj = —C 

and 


CŒ =C +C, i>0 
Equation (9.24) can then be written as 
Az, = C(B) (uo + 4t + Vr) 
= (C(1) + C*(1 — B)) (ay + 4t) + C(B)v: 


= C(1)m + C'(1)m + C(1)m, t + OCB), 
= bo + bit + C(B)v; 
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In levels, this becomes 


1 
Zt = Zo + bot + bı ug a ) H C(B) y vi 


2 s=1 
t(t+1 
=m tbo tb VY eea- B) v 
t(t+1 ig 
= Z bot b; ( 7 ) C(1)s, + C (B)(v; — Vo) 
t(t+1 
=2) + bot +b; ( = dy C(1)s, + C*(B)v; (9.26) 


where 


Zp = Zo — C*(B)vo, St = ` 


t 
s=1 Vs 

The inclusion of a linear trend in the VAR (9.24) implies a quadratic trend 
in the levels equation (9.26). Furthermore, since bı = C(1)u, this quadratic 
trend will disappear only if C(1)=0. Recall that C(1)M(1)=0, so that 
C(1) =0 requires that II(1) = —A £0. This will be the case only if H(B) does 
not contain the factor (1 — B), i.e. that z, is I(0), which has been ruled out by 
assumption and implies that A is of full rank n. If TI(1) = 0, so that A= 0, is 
of rank zero and contains n unit roots, then there is no cointegration and 
C(1), and hence b}, are unconstrained. In the general case, where the rank of 
A is r, it then follows that the rank of C(1) is n— r (see Banerjee et al., 1993, 
chap. 5.3.1). The rank of bı, and hence the number of independent quadratic 
deterministic trends, is thus also equal to n — r, and will therefore decrease as 
the cointegrating rank r increases. Without the restriction on the trend 
coefficient bj, the solution (9.26) will have the property that the nature of the 
trend in z, will vary with the number of cointegrating vectors. 

To avoid this unsatisfactory outcome, the restriction b4 = C(1)“,; = 0 may 
be imposed, in which case the solution for z, will contain only linear trends, 
irrespective of the value of r. The choice of r then determines the split 
between the number of independent linear deterministic trends, r, and the 
number of stochastic trends, n — r, in the model. 
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C(1) can be shown (see, for example, Banerjee et al., 1993, chap. 5.3.1) to 
have the representation 


C(1) = a, (8) (In — P(1))a) Bi 


so that the cointegrating vectors a ' z; have a linear but not a quadratic trend: 
since a'a, =0, a'C(1)=0 and 


a'z, = a z +a'C'(1)u,t+a'C*(B)v; (9.27) 
Note also that 


C(u, = a1 (BT, — P(1))a1) Fm 
= a, (B\(I,—®(1))a.) B17, =0 


so that b; =0 in (9.26) and u = py, in (9.24) are equivalent restrictions. 
This restriction may be imposed by setting 4 = Ac, where c is an nx 1 vector 
of unknown coefficients. In this case b; = C,Ac = —C,II(1)c =0 in (9.26). 
Furthermore, since C*(1)A=I, (see Pesaran and Shin, 2002), a'C* 
(1) =@a'C*(1)Ac=a'c, so that (9.27) becomes 


a'z,=a'z,+a'ct+a'C*(B)v; 


The cointegrating vectors will not contain linear trends if a' c = 0, and these 
are known as the “co-trending’ restrictions. 


9.5.3 Estimation of VECMs and tests of the cointegrating rank 


ML estimation of the VECM (9.22) is discussed in many texts: see, for 
example, Banerjee et al. (1993, chap. 8.2), Hamilton (1994, chap. 20.2) and 
Johansen (1995, chap. 6), and routines are available in most econometrics 
packages. Without going into unnecessary details, ML estimates are obtained 
in the following way. Consider (9.22) written as 


m—1 
An =e+ So, DiAz + pa zi +v, (9.28) 


The first step is to estimate (9.28) under the restriction Ba‘ = 0. As this is 
simply a VAR(m— 1) in Az, OLS estimation will yield the set of residuals V,, 
from which is calculated the sample covariance matrix 


T 
-1 X l a aT 
Soo =T pe Viv; 
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The second step is to estimate the multivariate regression 
m-l n 
Zp = Kk + ae 6 ;AzZ;—; + uy 


and use the OLS residuals u, to calculate the covariance matrices 


T 
—1 X l a aT 
Sii =- uu, 


t=1 


and 
T 
Lepal ~ aT _ oT 
Si = T ) UY: = So 


In effect, these two regressions partial out the effects of (Az,_1,..., 
AZ;—m +1) from Az, and z,_,, leaving us to concentrate on the relationship 
between Az, and z,_ 1, which is parameterised by fa '. a is then estimated by 
the r linear combinations of z,_, that have the largest squared partial cor- 
relations with Az, this is known as a reduced-rank regression. More pre- 
cisely, this procedure maximises the likelihood of (9.28) by solving a set of 
equations of the form 


(AiSi1 — S10So9 So1) vi = 0 (9.29) 


where A, > Az > ++: >A, are the set of eigenvalues and V = (vj, U2, ... , Un) 
is the set of associated eigenvectors, subject to the normalisation 


v'sivV=lI, 


The ML estimate of a is then given by the eigenvectors corresponding to 
the r largest eigenvalues, 


a= (U1, U2,- -, Ur) 


and the ML estimate of p is then given by 


B = Soa 


which is equivalent to the estimate of p that would be obtained by substi- 
tuting @ into (9.28) and estimating by OLS, which also provides ML 
estimates of the remaining parameters in the model. 

This procedure can be straightforwardly adapted when a linear trend is 
included in (9.28) and when the various restrictions are placed upon the 
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intercept and trend coefficients. This involves adjusting the first- and 
second-step regressions to accommodate the alterations (Pesaran and 
Pesaran, 1997, chap. 19.7, conveniently list the alternative set-ups). 

Of course, ML estimation is based upon a known value of the coin- 
tegrating rank, r, and in practice this value will be unknown. Fortunately, the 
set of equations (9.29) also provides a method of determining the value of r. 
If r=n and A is unrestricted, the maximised log-likelihood is given by 
Banerjee et al. (1993, chap. 9.3): 


Un) = K — (T/2) >" log(t — Ai) 


where K = —(T/2)(n(1 + log2z) + log|Soo|). For a given value of r< n, 
only the first r eigenvalues should be positive, and the restricted log- 
likelihood is 


L(r) =K- (T/2)X`_ log(1 — A) 


A likelihood ratio test of the hypothesis that there are r cointegration vectors 
against the alternative that there are n is thus given by 


ny = 2(L(n) — L(r)) = -T YD! log(1 — à) 


This is known as the trace statistic, and testing proceeds in the sequence no, 
N ---> Nn—1 A cointegrating rank of r is selected if the last significant 
statistic is n,_,, which thereby rejects the hypothesis of n— r+ 1 unit roots 
in A. The trace statistic measures the ‘importance’ of the adjustment coef- 
ficients J on the eigenvectors to be potentially omitted. 

An alternative test of the significance of the largest eigenvalue is 


¢, = —Tlog(1 —A,41), r=0,1,...,n—-1 


which is known as the maximal-eigenvalue or -max statistic. Both n, and ¢, 
have non-standard limiting distributions that are functionals of multivariate 
Brownian motions, and are generalisations of the Dickey—Fuller distributions 
discussed in chapter 3. Although there are no analytical forms for the distri- 
butions, critical values can be obtained by Monte Carlo simulation. The 
limiting distributions depend on n and on the restrictions imposed on the 
behaviour of the trends appearing in the VECM. For example, if u in (9.28) is 
replaced by fo + ft, then the ML estimation and testing procedures outlined 
above need to be amended to take into account both the presence of a linear 


365 Integrated financial time series 


trend and the various possible restrictions that could be placed on fy and 4 
(see, for example, Johansen, 1995, chaps. 6 and 15, for extended discussion). 
For this modelling framework to become operational, we have to determine 
the lag order m, the trend order l and the reduced (cointegrating) rank r. By 
‘trend order’ we mean that if /= 1 then the linear trend model is appropriate, if 
!=0 then only a constant is included, while if not even a constant is required 
we set l= —1 by convention. Typically, m and l are first determined using 
either an information criterion or a sequence of likelihood ratio or Wald tests, 
and, conditional on these settings, r is then determined by the sequence of 
trace or A-max tests. In empirical applications, however, the choice of r is 
frequently sensitive to the choice of m and land the trend restrictions. 

A further problem is that the trace and A-max tests rely on critical values 
drawn from limiting distributions, and these have been shown to have rather 
unreliable finite sample performance: see, inter alia, Reimers (1992), Haug 
(1996), Ho and Sørensen (1996) and Toda (1994, 1995). Small-sample 
corrections to the rank tests have been proposed by Johansen (2002a, 2002b), 
while Johansen (2006) summarises recent developments in the analysis of 
cointegrated systems. 

Given these complications, it is appealing to consider whether we can 
select jointly the cointegrating rank, the lag length and the appropriate 
restriction on the trend component. One possibility for doing this is to use 
information criteria to select between all possible models, as suggested by 
Liitkepohl (1991, chap. 11.4); see also Mills (1998). For example, if we denote 
the set of models to be considered as VECM(m, l, r), we could select that 
model that minimises BIC(m, l, r) as defined in chapter 8, section 7. 


Example 9.3 Cointegration in the UK financial markets 
In example 8.8 we analysed the vector (Ap, Ad, Ars, Ar20,) by implicitly 
assuming that there was no cointegration between the series. We now 


investigate whether the appropriate relationship between these four series is, 
in fact, a VECM in z;= (pp dp fsp 120,), although we do this using a shorter 
sample period, January 1969 to December 2000. With T thus equal to 384, we 
follow Saikkonen and Liitkepohl (1996) and set the maximum order of m to 
be considered as the integer part of T'’* — i.e. we set m=7. 

For all choices of l in the range —1 < 1< 1, the BIC (and, indeed, various 
other information criteria) selected m= 2. Since the BIC values for this lag 
order and alternative settings of | were very close, we decided to work with 
the most general trend setting and therefore set l= 1, so that a linear trend 
was included in the VAR. 
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Table 9.2 Cointegrating rank test statistics 


r Nr Àr Nr,0.05 


(a) Trend and constant in cointegrating vector: Mı = By, 


=0 63.17 0.0852 47.21 
<1 29.68 0.0350 29.98 
<2 15.31 0.0294 15.41 
<3 3.84 0.0100 3.76 


(b) Constant only in cointegrating vector: 4 =0, Mo = BYo 


=0 69.98 0.0962 62.99 
<1 31.12 0.0356 42.44 
<2 19.19 0.0321 25.32 
<3 4.68 0.0121 12.25 


Table 9.2 presents the sequence of trace statistics and associated eigen- 
values conditional upon m= 2 and l= 1. Employing a 5 per cent significance 
level, this suggests that r= 1, irrespective of whether the cointegrating vector 
contains both a trend and a constant or just a constant. We should note, 
however, that the statistic testing the null r=1 for the former case is very 
close to its 5 per cent critical value, and, since Banerjee et al. (1993, chap. 
9.5.3) warn against omitting cointegrating vectors in these circumstances, 
this perhaps points in favour of setting r=2. Nevertheless, in either case 
there is clear evidence of cointegration, implying that using a VAR in the first 
differences to model z, constitutes a misspecification. 

With r=2 and a trend included in the cointegrating vector, the ML 
estimation procedure produces 


aT _ | 7.878 —7.750 0.620 —6.789 
~ [2.442 —2.381 —2.577 1.633 


but, as noted earlier, these estimates are not unique, so that the question of 
how they can be interpreted arises. This is now discussed within the context 
of identification of VECMs. 


9.5.4 Identification of VECMs 


The assumption that the rank of A is r implicitly imposes (n — r}? restrictions 
on its n? coefficients, leaving n° — (n— r)?=2nr— f free parameters. 
The two nx r matrices a and f involve 2nr parameters, so that identifying 
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A= fa' requires a total of r restrictions. If the identifying restrictions are 
imposed only on 4, if they are linear and if there are no cross-cointegrating 
vector restrictions, then the restrictions can be written for the ith coin- 
tegrating vector as 


Ra; = 4a; (9.30) 


where R; and a; are an rxn matrix and an rx1 vector, respectively. 
A necessary and sufficient condition for a to be uniquely identified is that the 
rank of each R;a; is r, while the necessary condition is that there must be r 
restrictions placed on each of the r cointegrating vectors. The more general 
case of non-linear and cross-vector restrictions is discussed in Pesaran and 
Shin (2002). Note that identification of a, and hence A, is achieved solely 
through restrictions on a itself: long-run relationships cannot be identified 
through restrictions on the short-run dynamics — i.e. the ®; coefficients in 
(9.28) can be estimated freely. 

If the number of restrictions that are imposed on a is k, then k=” 
constitutes exact identification. The imposition of r restrictions on each of the 
r cointegrating vectors does not alter the likelihood L(r), so that, while their 
imposition enables a unique estimate of œ to be obtained, the validity of the 
restrictions cannot be tested. Typically r restrictions are obtained by nor- 
malisation, and if r= 1 then this is all that is required. For r > 1, a further 
° — r restrictions are required (r— 1 on each equation), and this forms the 
basis for Phillips’ (1991) triangular representation. This writes a as 


a’ =(I, -T] 


where T is an rx (n— r) matrix. The 7 just-identifying restrictions are thus 
made up of r normalisations and 7 — r zero restrictions, corresponding to 
solving a'z, for the first r components of z, 

When k >’, there are k— 1° overidentifying restrictions. ML estimation 
subject to the restrictions (9.30) is discussed in, for example, Pesaran and 
Pesaran (1997). If L(r: p) denotes the log-likelihood after the imposition of 
the p= k— r overidentifying restrictions, then the validity of these restric- 
tions can be tested using the likelihood ratio statistic 


2(L(r) — L(rsp) E 


Restrictions can also be imposed on f, and may link both a and $. The iden- 
tification, estimation and testing of very general sets of restrictions is discussed 
in Hendry and Doornik (2006) and programmed in their PcFiml 11.0. 
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9.5.5 Exogeneity in VECMs 


In the previous subsection we considered hypotheses about the cointegrating 
matrix a. We now consider hypotheses concerning the adjustment factors 
B. Suppose, as in chapter 8, section 6, we again make the partition z; = 


(yi, D and now write the VECM as 


m—-1 T 
Ay, — Po ®,;AZ;_; + pia Zt—1 + T; + Vit (9.31) 
m—-1 T 
AX; = Minas P; AZ + p-a Zt—1 + T: T V2t (9.32) 
where 


— | STE Iĝ = - 
gs | BY | dates p= (FI, Tj = Moy + Mat J= 1,2 


and where v, and its covariance matrix XZ, are partitioned as in chapter 8, 
section 7.2. Pre-multiplying (9.32) by œ = E pÈ; and subtracting the result 
from (9.31) yields the conditional model for Ay, 


m-l ~ = yi 
Ay, = @MAx; + ae ®,;AzZ+_; + (Bi = op,)a! Z + T; + Vit (9.33) 


where Êi; = ©); — o®;, Tı = T; — øT, and Vi; = vi; — @Vz; with cov- 
ariance matrix Lj). = 24, — SpE Dn. a enters both the conditional 
model (9.33) and the marginal model (9.32) unless a= 0. This is the 
condition for x, to be weakly exogenous for (a, pı), in which case the ML 
estimates of these parameters can be calculated from the conditional model 
alone (Johansen, 1995, theorem 9.1). 


Example 9.4 Identifying the cointegrating vectors and testing 

for weak exogeneity 
Given that we have found two cointegrating vectors in example 9.3, we now 
wish to identify them uniquely and, in so doing, see if we can provide them 
with an economic/financial interpretation. The estimates contained in the 
first row of â! suggest that the just-identifying restrictions a; and œ13 =0 
should be imposed, while the second row of â! suggests the restrictions 
Q21 = Q24 = — 1 and a7 = 3 = 1, i.e. on estimation 


ež, = py — dy + 120, — 5.548 + 0.0004t 
&, = —p, — di + rs, — r20; + 3.491 — 0.0006¢ 
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A likelihood ratio test of the set of overidentifying restrictions produces the 
statistic XG = 4.71, with a p-value of 0.32, so that the set of restrictions are 
accepted. 

Mills (1991a) discusses the equilibrium relationship often thought to hold 
between the equity and gilt markets: that the gilt and dividend yields should 
be in constant proportion to each other. This is exactly the form taken by e},, 
which, in terms of the levels, P, D, and R20, implies that the ratio 


R20,/(D,/Pr) 


is stationary. Since D,/P, is the dividend yield, in equilibrium the gilt yield 
and the dividend yield are in constant proportion to each other. Since 
deviations from this equilibrium are stationary, divergences from this ratio 
can only be temporary. This ratio was, in fact, exactly the decomposition 
used by investment analysts of the 1950s and early 1960s to analyse move- 
ments in equity prices, and it was termed by them the ‘confidence factor’ (see 
Mills, 1991a). 

The second error correction implies that the interest rate ‘spread’ R20,/RS, 
is directly proportional to the dividend yield D,/P,. Figure 9.13 plots the two 
error corrections, and both are seen to be stationary. Note that extreme 
values of the confidence factor are observed in 1975 and 1987, both periods 
of great upheaval in the UK equity market, but even here there is a marked 
tendency to move back towards equilibrium. 

The estimated ‘loading factor or ‘adjustment’ matrix is, with standard 
errors shown in parentheses, 


—0.080 0.011 

(0.024) (0.009) 

0.027 0.005 

Ê _ (0.006) (0.002) 
= 0.014 —0.025 
(0.023) (0.008) 

—0.003 —0.002 

(0.014) (0.005) 


The coefficients in the fourth row are individually insignificant, while at least 
one of the coefficients in the each of the first three rows is significant, so it is 
possible that B can be partitioned as B=[f, 0]' and the VECM can be 
partitioned as (9.31)/(9.32). Indeed, a test of B4; = B42 =0 cannot reject this 
joint null hypothesis (the statistic is x5 = 0.19), so r20 appears to be weakly 
exogenous with respect to a and f,,f>,f3. 
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9.5.6 Structural VECMs 


Following Johansen and Juselius (1994), a ‘structural VECM’ may be written 
as (abstracting from deterministic components) 


DoAz = 0" ' Piz; + Qa" y+, (9.34) 
which is related to the ‘reduced-form’ VECM 
Az = So") Azi + palz +v (9.35) 


through 

P;=M®; Mp=Q v, = Tov; 
so that 

Ey, ) = E = DOET, 


Note that in this framework we are assuming that the cointegrating vectors 
have already been identified, so that identification of the ‘short-run’ struc- 
ture, the parameters Io, l1, ... Fm- 1, Q, is carried out conditionally upon 
the form of a. This can be done using conventional methods and will typ- 
ically proceed in an exploratory fashion, as little is usually known a priori 
about the short-run structure (see Johansen and Juselius, 1994). 
Identification in the traditional econometric simultaneous-equations 
framework requires an a priori partitioning of the variables into endogenous 
and exogenous categories — i.e. as Z; = (yi, x!) "Wickens (1996) and Hsiao 
(1997) analyse the identification of such cointegrated simultaneous-equation 
systems. Their analysis highlights the difference between the two approaches. 
In the structural approach, the presence or absence of cointegration is pre- 
sumed in the structure of the model, along with the endogenous/exogenous 
classification of the variables. In the ‘time series’ approach, hypotheses about 
cointegration and exogeneity are determined by the data, so that, although 
less prior information is needed than in the structural approach, the data are 
required to be more informative so as to allow reliable inferences to be made. 


Example 9.5 An estimated structural VECM 
Table 9.3 presents the unrestricted ML estimates of the VECM(2,1,2) model 
selected in example 9.3, although the model is estimated conditionally upon 
the identified cointegrating vectors of example 9.4. The correlation matrix R 
suggests that there are contemporaneous relationships holding between Ap, 
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Table 9.3 Unrestricted estimates of VECM(2,1,2) model 


®, B 

Api-1 Ad, Ars,-1 Ar20;-1 et yt et Biyo ô 

Apt 0.149 0.150 0.073 —0.154 —0.080 0.011 0.0049 5.76% 
[2.74 0.80 [1.33 1.58] 3.37 [1.26 1.52] 

Ad, —0.028 —0.013 0.007 0.009 0.027 0.005 0.0071 1.53% 
[1.95 0.26 [0.46 0.35] 4.38 [1.96 9.34] 

Ars; —0.111 0.222 0.275 0.222 0.014 —0.025 —0.0007 5.58% 
[2.11 1.21 [5.16 2.35] 0.62 [2.91 0.22] 

Ar20 —0.136 0.058 —0.028 0.209 —0.003 —0.002 —0.0007 3.36% 
[4.28 0.52 [0.89 3.68] 0.25 [0.49 0.35] 

1 
= 0.09 1 Sead ace -12 
R=] 017 -0.02 1 Pe] = 2.00 x10 


—0.27 —0.02 0.44 1 


Note: R is the matrix of contemporaneous residual correlations. 


and Ar20, and between Ar20, and Ars, By embodying these relationships, 
the following estimated structural form was obtained by imposing various 
data-acceptable coefficient restrictions (a likelihood ratio test of the eighteen 
imposed restrictions yields a x° statistic of 12.19): 


Ap; = 0.0059 + 0.158 Ap;_1 — 0.083 e ,_,, ô = 5.78% 
[1.99] [3.10] [3.60] 

Ad, = 0.007 — 0.031 Apr1 + 0.028 ef ,_, + 0.005 e „1, ô = 1.52% 
[9.04] [2.33] [4.61] [1.99] 

Ars, = 0. 747 Ar20; + 0. 320 Ars,-1 — 0. 022 Sa ô = 5.68% 
10.30 

Ar20, = — 0.153 Ap; — 0.118 Ap;_; + 0.162 Ar20,_1, ô = 3.22% 

[5.46] [4.05] [3.31] 


The figures in [ ] are t-statistics. The error corrections appear in all but the 
Ar20 equation, but, as this contains Ap,_, as a regressor, no variable is 
weakly exogenous for the parameters in any of the four equations. Indeed, 
Ap,—1 appearing in this and the Ad, equation are the only cases when a 
lagged variable, other than a dependent variable, appears in an equation, thus 
demonstrating the importance of the cointegration framework in estab- 
lishing the presence of the error corrections: without the information con- 
tained in the cointegration properties of the data, only a small part of the 
variation of the data would have been explained, and few interesting regu- 
larities would have been uncovered. 
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9.6 Causality testing in VECMs 


Tests of hypotheses about a and f also appear when questions of causality 
arise in VECMs. Consider again the partition used in chapter 8, section 8.2, 


T : 
(yl, xj, r} ) , where the dimensions of the three vectors are nı, m and 


= 
nz = n — nı — m, and ®;and A= pa! are partitioned conformably. The null 


hypothesis that x does not Granger-cause y can then be formulated as 
Ao: 112 = ++: = P1132 =0, Ap =0 


where ®;;. and A2 are appropriate nXm, submatrices of ®; and A, 
respectively. 

Causality tests are often constructed from the OLS estimates of the VAR, 
however, which implicitly use an unrestricted estimate of II. Toda and 
Phillips (1993, 1994), by extending the analysis of Sims, Stock and Watson 
(1990), conclude that, when cointegration is present (i.e. when A12 = p 103, 
where pı and a are conformable partitions of p and a), standard Wald tests 
of causality constructed using an unrestricted estimate of A are distributed 
asymptotically as x° only if a is of rank m . If this rank condition fails, the 
limit distribution involves a mixture of a x° and a non-standard distribution 
that involves nuisance parameters. Unfortunately, since we require know- 
ledge of the cointegration properties of the system, which are not available 
simply from estimation of the ‘levels’ VAR, there is no valid statistical basis 
for ascertaining whether this rank condition actually holds. 

If there is no cointegration, then the Wald statistic for causality again has a 
non-standard limit distribution, although in this case it is free of nuisance 
parameters, so that critical values can be tabulated conveniently. If it is 
known that the system is J(1) with no cointegration, so that A= 0, however, 
then of course we have a VAR in the differences Az, and causality tests in 
such models do have x° distributions, for we are back in the framework of 
chapter 8, section 7. Toda and Phillips (1993) argue that such tests are likely 
to have higher power than tests from the levels VAR as they take account of 
the unit root constraint A=0, while the latter tests contain redundant 
parameter restrictions. 

When we have cointegration, causality tests should optimally be con- 
structed from the VECM, in which we know the value of the cointegrating 
rank r. In such models, it is often natural to refer to the first half of the 
hypothesis Hp as ‘short-run non-causality’ and the second half as ‘long-run 
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non-causality’. It is testing for long-run non-causality in VECMs that gives 
rise to difficulties. Toda and Phillips (1993, theorem 3) show that the 
standard Wald statistic for testing Ho will have an asymptotically valid x7 
distribution only if either the rank of az is mz or the rank of p; is n, in which 
case the statistic will be asymptotically distributed as X, min 

Before we can apply these conventional x° asymptotics, we need to test 
whether either of the two rank conditions actually holds. This can be done 
using the ML estimates of these matrices, after which causality tests can then be 
carried out. The Wald statistics required are extremely difficult to construct 
and the testing sequence is complicated, however, as the papers by Toda and 
Phillips show. Because of the complexity of this procedure, and because it 
requires prior knowledge of r (which typically can be obtained only by pre- 
tests), it would be useful if alternative, simpler, strategies were available. 

A more straightforward procedure has been proposed by Toda and 
Yamamoto (1995) (see also Saikkonen and Liitkepohl, 1996). Suppose we 
consider the levels VAR(m) model again but now augment the order by one 
— i.e. we fit a VAR(m-+ 1). The non-causality hypothesis can now be tested by 
a conventional Wald statistic, because the additional lag, for which ®,,,;.=0 
by assumption, allows standard asymptotic inference to be used once again. 
Under the assumption here that the elements of z; are at most I(1), the 
inclusion of one additional lag in the estimated model suffices. For general 
orders of integration, a VAR(m-+ dmax) should be fitted, where dmax is the 
maximum order of integration of the components. It is thus not necessary to 
know precisely the orders of integration or the cointegration rank. 

It is not surprising, then, that this approach is less powerful than the Toda 
and Phillips approach, and it is also inefficient, as the order of the VAR is 
intentionally set too large (see the discussion in Stock, 1997). If the number 
of variables in the VAR is relatively small and the lag order is quite large, 
however, adding an additional lag might lead to only minor inefficiencies, 
while the pre-test biases associated with cointegration tests may be more 
serious. Given the ease with which the tests can be constructed, this ‘lag 
augmentation’ VAR (LA-VAR) approach should be seriously considered, 
particularly as Monte Carlo evidence presented by Yamada and Toda (1998) 
shows that it has excellent performance in terms of size stability when testing 
for Granger causality. 


Example 9.6 Causality tests using the LA-VAR approach 
Causality tests using a VAR(2) model were constructed using the LA-VAR 
procedure. Since each series making up the VAR appears to be I(1), a VAR(3) 
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Table 9.4 Granger causality tests using LA-VAR estimation 


li fe p d rs r20 

p i 7.85* 4.79 9.22* 
d 13.58* - 2.00 14.18* 
rs 5.32 5.36 - 9.40* 
r20 21.53* 9.42* 1:23 - 


was actually fitted, leading to the causality test statistics shown in table 9.4. 
There is strong evidence of causal patterns, except for those involving rs, which 
does not cause any other variable and is caused only by 120. These statistics 
therefore do not pick up the evidence of ‘long-run’ causality running from the 
other variables to rs found in the structural VECM formulation of example 9.5. 


9.7 Impulse response asymptotics in non-stationary VARs 


As shown in chapter 8, section 7, the various impulse responses of the VAR 
are computed from the sequence of matrices 


Y= ie TY ;_;, Y=I1, W;=0, i<0 


Their computation remains exactly the same in non-stationary VARs but, if 

A=- 4 I] is of reduced rank, the elements of Y; will not die out as 1 

increases, and this leads to some analytical complications. Following Phillips 

(1998), we consider the behaviour of these impulse responses as the lead time 

i— oo, and the asymptotic behaviour of estimates of these quantities as T— oo. 

In stationary VARs, where all the roots of the long-run multiplier matrix A 

lie outside the unit circle, the system’s estimated impulse responses are T- 

consistent and, upon appropriate centring and scaling, have asymptotic 

normal distributions (see Liitkepohl, 1991, chap. 3.7): as i— oo , both the W; 

and their estimates W; tend to zero. For non-stationary VARs, where the Y; 

do not necessarily die out as i— œo, Phillips (1998) shows that a very dif- 

ferent limit theory holds for the impulse response estimates, which may be 

summarised thus (see also Stock, 1996). 

(i) When there are unit roots in the system, the long-horizon impulse 
responses estimated from a levels VAR by OLS are inconsistent, the 
limiting values of the estimated responses being random variables 
rather than the true impulse responses. The reason for this is that, 
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because these true impulse responses do not die out as the lead time 
increases, they carry the effects of the unit roots with them indefinitely. 
Since the unit roots are estimated with error, the effects of the estimation 
error persist in the limit as T— oo . The limiting distributions of Ý, as 
i— oo are asymmetric, so that confidence intervals for impulse responses 
will be as well. 

(ii) The limiting impulse responses in a cointegrated VAR model are non- 
zero only in those directions where the model is non-stationary and has 
unit roots — i.e. a}. They are estimated consistently as long as the 
cointegrating rank is either known or is itself consistently estimated, 
either by an order selection method or by using classical likelihood ratio 
tests that are suitably modified to ensure that the size of the test goes to 
zero as the sample size goes to infinity. This is because, in a reduced-rank 
regression, the matrix product Ba! is estimated rather than A, so that no 
unit roots are estimated (either explicitly or implicitly). Simulations 
reported by Phillips (1998) show that impulse responses are estimated 
accurately by such procedures. Nonetheless, these consistent selection 
procedures will tend to mistakenly take roots that are close to unity as 
actually being unity, so that, rather than dying out, they will converge to 
non-zero constants. Furthermore, as Stock (1996) shows, in these 
circumstances prediction intervals will be undesirably wide. 

It is clear from these results that impulse responses for non-stationary 
VARs should not be computed from an unrestricted levels VAR. Knowledge 
of the number of unit roots in the system is very important for obtaining 
accurate estimates, so it is important that the cointegrating rank is selected 
by a consistent method that works well in practice. 


Example 9.7 Impulse responses from the VECM 

The VECM(2, 1, 2) model arrived at in example 9.5 has an implied long-run 
matrix, given by Ba! + L, that has two unit roots (given by the two coin- 
tegrating vectors) and two real roots of 0.974 and 0.889. Consequently, 
impulse responses converge to non-zero constants (rs to effectively zero), as 
shown in figure 9.14(a). Of particular interest is the long-run effect of don p, 
which steadily accumulates over three years. This result has been remarked 
upon in Mills (1991a) as being consistent with the views of market profes- 
sionals who believe that financial factors have only a short-run impact on 
equity prices, with dividends being the long-run driving force. 

OLS estimation of the unrestricted VAR model provides the roots 1.004, 
0.965, 0.934 and 0.880, and the impulse responses are shown in figure 9.14b. 
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The presence of even a very marginal explosive root has a dramatic effect on 
the impulse responses as the horizon increases, with all responses tending to 
get larger, particularly those associated with d. Thus, even though the long- 
run matrices are almost identical, estimating the two unit roots as 1.004 and 
0.965 produces major differences to the impulse responses. 


9.8 Testing for a single long-run relationship 


Consider again the levels VAR of (9.24), now written as 


Zi = My + Ut + ys Hiz-i+ vr 
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(a) VECM impulse responses 


Figure 9.14 Estimated impulse response functions 
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(b) OLS impulse responses 


Figure 9.14 (continued) 
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where the elements of z, are permitted to be either I(0), I(1) or cointegrated, 
along with the unrestricted VECM 


Az, = My +t + Az +Y DiAz tve, (9.37) 
where 

A=) I-I, 

and 
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are now referred to as the matrices of the long-run multipliers and the short- 
run dynamic coefficients, respectively. 
; ae t\T , 
Consider now the partition z; = (y, X; ) , where y; is scalar, and define 


the conformable partitions v; = (Vit, Var) and 


u= Mji j=0,1, A= F a D; = Piri Qizi 
Mj Azn A2 by; P22; 
w- Bs | 
62, Ly 


This is similar to the partitioning used in section 9.5.5 to investigate weak 
exogeneity in VECMs, although here we do not assume that A is necessarily of 
reduced rank. We do, though, assume that A2; = 0, which ensures that there 
exists at most one (non-degenerate) long-run relationship between y, and x, 
irrespective of the order of integration of the x, process. Equation (9.37) can 
then be written in terms of the dependent variable y,and the forcing variables x, as 


Ayı = Moy + Hiit + Anyi + AX- 
m—1 m—1 
T `> Pi Ani + ee P12 AXi + Yir (9.38) 


i=1 


AX; = Mog + 42t + A2Xt-1 
m-—1 m—1 
+ tee Pa, Ayi- + za P22, AXi + Var (9.39) 
The contemporaneous correlation between v,,and vz; can be characterised 
by the regression 
Vip =O! Vy + &; (9.40) 


where œ = E62, {£} isa WN (0, o?) process with og =o, — 61225 02, 
and the {v2} and {£} processes are uncorrelated by construction. Substituting 
(9.39) and (9.40) into (9.38) yields 


Ay; = a + at + y1 +Ô X1 
m-1 m—1 
T a WiAN—-i+ oa P12; AX1-i + É, (9.41) 
where 


= T = T z. E N T 
49 = Moi — Moz, a = Hun- l, $=An, ô= A; -o 


E T LAT _ T 
WE puni 70 Pai PEO, 9;=9;-0 Dn; 
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It follows from (9.41) that, if p #0 and 60, then there exists a long-run 
relationship between the levels of y, and x, given by 


Yt = Oo + 0t + 0' x + Ut (9.42) 


where 0o = — ao/ġ, 01 = — a,/, 0 = — d/¢ is the vector of long-run response 
parameters and {v,} is a mean zero stationary process. If @ < 0 then this long- 
run relationship is stable and (9.42) can be written in the ECM form 


Ay; = a) + at + p(y — 0'x,-1) 
m—1 m-1 
+ ye WiAy-i+ X ap P12, AXr-i + E, (9.43) 


If ¢ <0 in (9.43) then no long-run relationship exists between y, and x,. A test 
for @ <0 runs into the difficulty that the long-run parameter vector 0 is no 
longer identified under this null, being present only under the alternative 
hypothesis. Consequently, Pesaran, Shin and Smith (2001) test for the absence 
of a long-run relationship, and avoid the lack of identifiability of 0, by 
examining the joint null hypothesis ¢= 0 and 6 = 0 in the unrestricted ECM 
(9.41). Note that it is then possible for the long-run relationship to be 
degenerate, in that 640 but 6=0, in which case the long-run relationship 
involves only y, and possibly a linear trend. 

Pesaran, Shin and Smith (2001) consider the conventional Wald statistic 
of the null ¢ Æ 0 and 6 = 0 and show that its asymptotic distribution involves 
the non-standard unit root distribution and depends on both the dimension 
and cointegration rank (0< r< k) of the forcing variables x, This coin- 
tegration rank is the rank of the matrix A22 appearing in (9.39). Pesaran, Shin 
and Smith obtain this asymptotic distribution in two polar cases: (i), when 
Ap, is of full rank, in which case x; is an I(0) vector process; and (ii), when the 
x, process is not mutually cointegrated (r = 0 and Aj, = 0) and hence is an 
I(1) process. They point out that the critical values obtained from stochas- 
tically simulating these two distributions must provide lower and upper 
critical value bounds for all possible classifications of the forcing variables 
into 1(0), I(1) and cointegrated processes. 

A bounds procedure to test for the existence of a long-run relationship 
within the unrestricted ECM (9.41) is therefore as follows. If the Wald (or 
related F-) statistic falls below the lower critical value bound, then the null 
ġ = 0 and 6=0 is not rejected, irrespective of the order of integration or 
cointegration rank of the variables. Similarly, if the statistics are greater than 
their upper critical value bounds, the null is rejected and we conclude that 


381 Integrated financial time series 


there is a long-run relationship between y, and x,. If the statistics fall within 
the bounds, inference is inconclusive and detailed information about the 
integration/cointegration properties of the variables is then necessary in 
order to proceed further. It is the fact that we may be able to make firm 
inferences without this information, and thus avoid the severe pre-testing 
problems usually involved in this type of analysis, that makes this procedure 
attractive in applied situations. Pesaran, Shin and Smith provide critical 
values for alternative values of k under two situations: case 1, when dy Æ 0, 
a, = 0 (with intercept but no trend in (9.41)), and case 2, when ay 40, 
a, #0 (with both intercept and trend in (9.41)). 

Pesaran, Shin and Smith show that this testing procedure is consistent and 
that the approach is applicable in quite general situations. For example, 
equation (9.41) can be regarded as an autoregressive distributed lag 
(ARDL) model in y, and x, having all lag orders equal to m. Differential lag 
lengths can be used without affecting the asymptotic distribution of the test 
statistic. 


Example 9.8 Is there a long-run market model? 
In examples 9.1 and 9.2 we investigated whether there was cointegration 
between the LGEN stock price and the FTSE 100 index on the assumption 
that the logarithms of both series were I(1). As was remarked in example 9.1, 
the latter series is only marginally I(1): its ADF test statistic is —3.38, which is 
close to the 5 per cent critical value of — 3.42. We thus investigate the 
existence of a long-run relationship between the two series (denoted once 
again as y, and x;) using the testing technique outlined above, which does not 
require a definite classification of the integration properties of x, Estimating 
equation (9.41) with m=3 produced a Wald test statistic of 11.86 for both 
case 1 and case 2, as the trend was found to be completely insignificant. With 
k=1 as here, the 5 per cent significance level bounds for the Wald statistic 
for case 1 are 9.87 and 11.53, so that the hypothesis of no long-run rela- 
tionship is clearly rejected, irrespective of the order of integration of x, 
Given this evidence in favour of a long-run relationship, we then fitted a 
parsimonious form of the ECM (9.43), obtaining 


Ay, = — 0.053 (a — 1.086 sa) — 0.118(Ayı + Ay-2) 
(0.015) (0.005) (0.028) 


+ 1.106 Ax, + 0.151 Ax,_» 
(0.057) (0.063) 
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Figure 9.15 Impulse responses from the two market models 


How does this model, which has an equation standard error of 3.02 per cent, 
compare with the model fitted in example 9.2? That model, written in ECM 
form, is 


Ay; = — 0.006 — 0.057(y;-1 — x%-1) — 0.128(Ay,_; + Ay;_2) 
+ 1.107Ax; + 0.128(Ax;-1 + Ax;-2) 


There is thus a difference in the long-run response — 1.086 compared to 1 — 
and some differences in the short-run dynamics (as well as a constant being 
significant in the latter model). Written in levels, the two models are 


Ve = 1.1062 — 1.048x_1 + 0.151x;-2 — 0.151x)-3 + 0.8291 + 0.1183 
and 


yte = 1.107x; — 0.922x,-1 — 0.128x;-3 + 0.815y,-1 + 0.128y;—3 — 0.006 


Figure 9.15 shows the impulse response functions calculated from the two 
models. It is seen that the impact effect is almost identical in the two models 
and that, after some initial fluctuations, both functions converge mono- 
tonically to long-run equilibrium, even though the shapes are very different. 
Nevertheless, the median lags are almost the same, being of the order of 
twelve months, so that convergence to equilibrium is rather slow. 
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9.9 Common trends and cycles 


Consider again the VAR(m) process for the I(1) vector Z, 
TI(B)z; = V; 


where, for simplicity, we exclude the constant and linear trend and set initial 
values Zo = Vo = 0. Analogous to the ‘levels solution’ (9.26), we then have 


Zt = C(1)s, + C*(B)v; 


If there is cointegration then C(1) is of reduced rank k=n—r and can be 
written as the product yd', both of which have rank k. Thus, on defining 


Tt = 6's, Cy = C*(B)v; 
we have the Stock and Watson (1988) ‘common trends’ representation 


Zt = YT + 


9.44 
Tra + ôv, l ) 


Tt 


which expresses z; as a linear combination of k= n — r random walks, being 
the common trends t, plus some stationary ‘transitory’ components C, 
Equation (9.44) may be regarded as a multivariate extension of the Beveridge 
and Nelson (1981) decomposition introduced in chapter 4, section 1.1. As 
Wickens (1996) shows, 6 is not uniquely defined (cf. the argument con- 
cerning the cointegrating matrix a), so these trends are also not uniquely 
defined without introducing additional identifying conditions. 

In the same way that common trends appear in z, when C(1) is of reduced 
rank, common cycles appear if C*(B) is of reduced rank, since c,= C (B)v,; is 
the cyclical component of z. The presence of common cycles requires that 
there are linear combinations of the elements of z, that do not contain these 
cyclical components — i.e. that there is a set of s linearly independent vectors, 
gathered together in the n x s matrix @, such that 


p'c = ob CŒ (B)v; =0 
in which case 


h TZ =¢ Tyt, 
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Such a matrix will exist if all the C} have less than full rank and if @ 'C; = 0 
for all i (see Vahid and Engle, 1993, and Engle and Issler, 1995). 

Under these circumstances, we can write C} = GC; for all i where G is an 
nX(n—s) matrix having full column rank and C; may not have full rank. 
The cyclical component can then be written as 


C, = GC(B)v; = Gc; 


so that the n-element cycle c, can be written as linear combinations of an 
n—s-element cycle čų, thus leading to the common trend/common cycle 
representation 


Zp = yt; + Ge; (9.45) 


The number, s, of linearly independent ‘cofeature’ vectors making up @ can 
be at most k= n — r, and these will be linearly independent of the coin- 
tegrating vectors making up a (Vahid and Engle, 1993, theorem 1). This is a 
consequence of the fact that |Z, being the vector of common trends, is I(1), 
whereas a Zp being the vector of error corrections, is I(0). 

An interesting special case of the representation (9.45) occurs when 
r+s=n. In these circumstances, z; has the unique trend/cycle decom- 
position Z; = zt + z{, where 


T T 
z = 0Ħ¢ z =0¢ yt; 
contains the stochastic trends and 
= Oa z =Ħ0a c 


contains the cyclical component. Here 


71-1 
[0 @]= f r] 

Note that zf is a linear combination of the error correction terms e, = a Z, 
Since both zj and z¢ are functions of a and @, they can easily be calculated as 
simple linear combinations of Zz, 

The common trend/common cycle representation (9.45) depends, of 
course, on the number of cointegrating and cofeature vectors, r and s, in the 
system. The number of cofeature vectors (i.e. common cycles) can be deter- 
mined using the approach of Engle and Kozicki (1993), as extended by 
Vahid and Engle (1993) to the current context in which there may also be 
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cointegration. The rank s of the cofeature matrix can be determined by 
calculating the test statistic 


C(s) = —(T — m— 2) ae log(1 — ¢7) 


where 44, ... , £; are the s smallest squared canonical correlations between 
Az, and the set (Az;_1,..., AZr—-m41,er-1). Under the null hypothesis 
that the rank of @ is at least s, this statistic has a x° distribution with 
s+sn(m—1)+ sr—sn degrees of freedom (Vahid and Engle, 1993). The 
canonical correlations may be computed using the procedure outlined in 
Hamilton (1994, chap. 20.1). 

An equivalent representation is obtained by incorporating the s cofeature 


vectors, as well as the r cointegrating vectors, into the VECM representation 
AZ; = ®(B) AZ] + Ber + Vt (9.46) 


directly. Vahid and Engle (1993) point out that the cofeature matrix @ is 
identified only up to an invertible transformation, as any linear combination 
of the columns of @ will also be a cofeature vector. The matrix can therefore 
be rotated to have an s-dimensional identity submatrix 


I, 
= Leica 


| Az, can then be considered as s ‘pseudo-structural-form’ equations for 
the first s elements of Az, The system can be completed by adding the 
unconstrained VECM equations for the remaining n — s equations of Az, to 
obtain the system 


AZs 


*T 


O(n—s)xs | er Pi- Dp A Niimi 
€t—ı 


where Ọ{ contains the last n — s rows of ®,, etc. Writing the restricted model 


in this way makes it clear why there are s* + sn(m— 1) + sr—sn degrees of 
freedom for the common feature test statistic C(s). The unrestricted VECM 
(9.46) has n(n(m— 1) + r) parameters, whereas the pseudo-structural model 
(9.47) has sn— s* parameters in the first s equations and (n— s)(n(m-— 1) 
+ r) parameters in the n — s equations that complete the system, so imposing 


2 Bug 
s+ sn(m— 1) + sr— sn restrictions. 
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The system (9.47) can be estimated by full-information maximum like- 
lihood (FIML) or some other simultaneous equation estimation technique, 
and a likelihood ratio statistic of the restrictions imposed by the s cofeature 
vectors can then be constructed, which will be equivalent to C(s). Equiva- 
lently, the common cycle restrictions can be imposed directly on the VECM 
to yield 


*T 
AZ, = Es , | [DiAz +: + OF AZ nti + B*e] +v: (9.48) 
which is a reduced-rank VECM. Note that, if m= 1 and r= n — s, the system 
will be just-identified and no test for common cycles is needed, for the 
system will necessarily have r common cycles. As the lag order m increases, so 
the system will generally become overidentified and tests for common cycles 
become necessary. 

From (9.48), it is clear that the presence of s common cycles implies that 
ob! Az, is independent of Az,_j,..., AZ— m+ 1 and e;_,, and hence of all 
past values of v, Vahid and Engle (1997) have subsequently generalised this 
approach to consider “codependent’ cycles. A codependent cycle of order q is 
the linear combination of Az, that is independent of v,_;, j> q, so that a 
common cycle is a codependent cycle of order 0. For a recent survey of 
common trends and cycles modelling, see Vahid (2006). 


Example 9.9 Are there common cycles in the UK financial markets? 
In example 9.3 we found that, in the VECM fitted to z;= (pp dp rsp 120), 
there were r=2 cointegrating vectors and hence k=2 common trends. 
There can, then, be at most two common cycles. If s was two, then the 
structural model (9.47) with m= 2 would take the form of two structural 
equations: 


Ap; = -pi Ars, — P4 Ar20; + vit 
Ad, = —$3,Ars; — $5,A7120; + vt 


say, and two unrestricted reduced-form equations for the other two 
variables, Ars, and Ar20, which imposes a total of eight restrictions 
The restricted reduced-form (9.48) replaces the above two structural equa- 
tions with reduced-form equations in which the coefficients are linear 
combinations of @; and f", the weights being given by (minus) the œ 
coefficients. 


* 
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Table 9.5 Common cycle tests 


Null C(p, s) df p-value 


s>0 6.15 3 105 


s>l 35.47 8 .000 


Table 9.5 provides the common-feature test statistics, from which we 
see that s= 1. The estimated common cycle in the pseudo-structural model 
(9.48) with s= 1 takes the form, after the deletion of insignificant coefficients, 


$, Az, = Ap; + 1.440Ar20, 


so that, once again, we find that the equity and gilt markets are contempor- 
aneously negatively correlated and their logarithmic changes share a common 


cycle. 


10 


Further topics in the analysis of 
integrated financial time series 


In this chapter we investigate several further topics in the analysis of inte- 
grated time series. Section 1 looks at the links between present value models, 
excess volatility, rational bubbles and cointegration, while section 2 con- 
siders non-linear extensions of cointegration and error correction models, 
and briefly discusses some recent techniques for introducing structural 
breaks and infinite variance errors into the cointegrating framework. 


10.1 Present value models, excess volatility and cointegration 


10.1.1 Present value models and the ‘simple’ efficient markets hypothesis 


388 


As remarked in chapter 1, present value models are extremely popular in 
finance as they are often used to formulate models of efficient markets. 
Written generally, a present value model for two variables, y, and x, states 
that y, is a linear function of the present discounted value of the expected 
future values of x; 


Yr = (1 — 8) SO” SE(xr4:1®r) + c (10.1) 


where c, the constant, ¢, the coefficient of proportionality, and ô, the con- 
stant discount factor, are parameters that may be known a priori or may need 
to be estimated. As usual, E(x; ;|®,) is the expectation of x,,; conditional on 
the information set available at time t, ®;. 

A simple example of how (10.1) might arise is to consider an implication 
of the efficient markets hypothesis, that stock returns, r, are unforecastable. 
This can be formalised as E(1;4;|®;) = r, where r is a constant, sometimes 
referred to as the discount rate (see Shiller, 198la, 1981b). If y, is the 
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beginning of period t stock price and x; the dividend paid during the period, 
then 


Tri = (Yei — Ye + x) / y 
so that we can express y,as the first-order rational expectations model of the form 


where 6 = 1/(1+ r). This can be solved by recursive substitution to yield 
t= Di BT E(x 1) +8" E(yr+nl ®r) (10.3) 


If we impose the terminal (or transversality) condition that the second term in 
(10.3) goes to zero as n— ov, the present value relation (10.1) is obtained with 
c=0 and d= 1/(1 — ô) = (1 + r)/r. 

Typically, y, and x, will be I(1) processes, so Campbell and Shiller (1987, 
1988a) consider subtracting (6/(1—6))x, from both sides of (10.3). On 
defining 0 = 6/(1 — ô) = 1/r and rearranging, we obtain a new variable, S, 
which Campbell and Shiller (1987) term the ‘spread’: 


Si = ye — Ox, =X SE(Axr4:| 1) (10.4) 


If y,and x, are I(1), then it follows from (10.4) that S, must be [(0), which in turn 
implies that y, and x; are cointegrated with cointegrating parameter 0. Conse- 
quently, S, and Ax, must together form a jointly covariance stationary process, 
which can be approximated in finite samples by a bivariate VAR(m) process: 


S, = yo aiSt—i + ae biAXxti + Vit 


(10.5) 
m m 
Ax; = ie CiSt—i + a dj AxX;—j + var 
It is convenient to rewrite (10.5) in companion form — i.e. as 
St Ay Am- Am By bm- bm Si-1 Vit 
S11 1 S12 0 
Si—m-+1 = 1 0 Si—m + 0 
Ax; a e Cna Cm h e dm- dm Axı Vat 


AX 1 AX-2 0 


AXt—m+1 1 0 AXx;—m 0 
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where blank elements are zero. This can be written more compactly, in an 
obvious notation, as 


Zt = Tz +V; (10.6) 


We can use the first-order formulation (10.6) to express the variant of the 
present value model presented as equation (10.4) in closed-form solution — 
i.e. as a function of variables known to agents at the time expectations are 
formed. If we restrict the information set to consist only of current and 
lagged S, and Ax, — i.e. ®F = (S, Ax), using the notation introduced in 
chapter 8 — then the conditional expectation of future values of z;, condi- 
tional on 7, is 


E(z,4;|0%) = mz 


Define g as a (2mX1) selection vector with unity as the first element and 
zeros elsewhere, and h as another selection vector with unity as the (m+ 1)-th 


element and zeros elsewhere, so that S, = g Zi, Ax, = h'z,, and 
E(Ax,;|®*) = E(h'z,,;|®7) = h' Tz, 
Equation (10.4) can then be written as 


g'z = 0h" 632 öt’) z, = oh ST (I — ôT) z (10.7) 


which is a closed-form variant of the present value model. The advantage of 
this formulation is that it imposes the model’s restrictions on the coefficients 
of the VAR, since, if (10.7) is to hold non-trivially, the following 2m 
restrictions must hold: 


g! — ehl SI (I— 511) "'=0 (10.8) 


Although these restrictions appear complex and hard to interpret, for a 
given ô, and hence 6, they turn out to be equivalent to this set of linear 
restrictions: 


1 — da, — 05c, = 0 
a,+0c;=0 i=2,...,m (10.9) 
bd =O Gaon 
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These restrictions can be interpreted as follows. The present value model 
implies that, from (10.2), 


E(y, — 8 "y-1 + %-1|®7) =0 (10.10) 
or, equivalently, 

E(y,|®*) = Tiy — Xt-ı 

so that 5~'y,_; — x;_1 is an optimal predictor of y, Since 

E(y,|®;) = (1+ 1r)y-1 — xı 

we also have 

E(((% — Yı + %-1)/%-1)|®*) —r=0 


i.e. that excess expected returns are zero. In terms of S, and Ax, this can be 
written as 


E(S; — 8-'S,_1 + OAx,|®*) = 0 

or 

E(S,|®7) = 87'S,_; — 0Ax, (10.11) 
Using the VAR formulation (10.5), we have 


E(S; — 8°'S,-1 + 0Ax,-1|®) = XO (ai + 0)S:-1 + D0, (bi + Od) Ax; 


which is identically equal to zero under the restrictions (10.9). 

A further implication of the present value model for the VAR (10.5) is that S, 
must Granger-cause Ax; unless S, is itself an exact linear function of {x ts This 
is because S, is an optimal forecast of a weighted sum of future values of Ax, 
conditional on ®, (recall equation (10.4)). S, will therefore have incremental 
explanatory power for future Ax; if agents have information useful for fore- 
casting Ax, beyond { x? V if not, they form S,as an exact linear function of { x? }. 

Following Campbell and Shiller (1987), we can also use these restrictions 
to construct ‘volatility tests’ of the model. If the ‘theoretical spread’, S*, is 
defined as 


Si =X S'E(Ax4i|®¢) = Oh" ST1(I — 81) ‘zy 
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then, if the present value model is correct, we have it from (10.4) that S% = S$, 
and hence V(S*) = V(S,). This equality provides a way of assessing the 
model informally by examining the comovement of Ve) and V(S,). In 
particular, if the model is correct, the ratio V(S,) / V(S*) should differ from 
unity only because of sampling error in the estimated coefficients of the VAR. 

Campbell and Shiller (1987) also suggest a second volatility test in add- 


ition to this ‘levels variance ratio’. Denoting the innovation associated with 
(10.11) as 


& — St => 8 Shy + O6Ax; 
the ‘theoretical innovation’ can be defined analogously as 
& = St 8 'S* + 0Ax, 


Under the present value model, & = £, as S* = S, so the ‘innovation 
variance ratio’, V(&,) y. V(&); should again be compared with unity. 

The interpretation of (10.1) as the present value of a stock price given the 
future dividend stream relies on y, and x, being the levels of prices and 
dividends, respectively. If dividends grow at a constant rate g< r then 


E(x14i|®1) = (1 + g)'x 

and 

E(Axi4i|®r) = (1 +g)" gx 
so that (10.4) becomes 


1 
S, = WEDE (10.12) 


r(r= 8) 
which is clearly no longer I(0). Since (10.12) implies, however, that 


(1+g) _ 1 
(r=g 


5 (10.13) 


the ‘full spread’, 


ie e PEE 
HOM r BoM ET 


393 Integrated financial time series: further topics 


will be I(0). Equation (10.13) can be written as 
log(y:) — log(x:) = « 


where «x = (1+ g)/(r — g), so that, when expressed in logarithms, prices 
and dividends are cointegrated with a unit-cointegrating parameter. This 
representation leads to the “dividend ratio’ form of the model, to be 
developed in section 10.1.3. 


10.1.2 Rational bubbles 


These tests of the present value model have all been based on the assumption 
that the transversality condition in equation (10.3) holds — i.e. that 


limano S" E(Yiynl Pr) = 0 


If this is the case then y; = yl , where yl is the unique forward solution, often 
termed the ‘market fundamentals’ solution, 


ye = oe OT Enla) 


If this transversality condition fails to hold, however, there will be a family of 
solutions to (10.2): see, for example, Blanchard and Watson (1982), West 
(1987) and Diba and Grossman (1987, 1988). In such circumstances, any y; 
that satisfies 


yay +B, 


where 
E(B,41|®,;) = ôB; = (1 + r)B; (10.14) 


is also a solution. B,is known as a speculative, or rational, bubble, an otherwise 
extraneous event that affects y, because everyone expects it to do so — i.e. it is 
a self-fulfilling expectation. 

An example of such a bubble is (see Blanchard and Watson, 1982, and 
West, 1987) 


B,_,; — B)/x8 with probability z 
n= (Br_1 — B)/ p y is 


B/(1 — x)ô with probability 1 — 7r 


394 


The Econometric Modelling of Financial Time Series 


where 0<2<1 and B>0 (other examples are provided by, for example, 
Hamilton, 1986). According to (10.15), strictly positive bubbles grow and 
burst, with the probability that the bubble bursts being 1 — x. While the 
bubble floats it grows at the rate (6) '= (1+ r)/m>1 + r; investors in the 
asset thus receive an extraordinary return to compensate them for the capital 
loss that would have occurred had the bubble burst. 

Equation (10.14) implies that the rational bubble has explosive condi- 
tional expectations, since 


E(B,+i]®,) = (1 + r)'B, 


and r> 0. Thus, if y,is the price of a freely disposable asset, say a stock, then a 
negative rational bubble (B,< 0) cannot exist, because its existence would 
imply that y, decreases without bound at the geometric rate (1 + r), so that it 
becomes negative at some finite time t+ i. Negative rational bubbles are, at 
least theoretically, possible if y, is an exchange rate, for this characterises a 
continual currency appreciation. 

While positive bubbles are theoretically possible, Diba and Grossman 
(1987, 1988) discuss a number of conditions that must be met for their 
existence. Positive bubbles imply that asset holders might expect such a 
bubble to come to dominate y, which would then bear little relation to 
market fundamentals. Bubbles would be empirically plausible only if, despite 
explosive conditional expectations, the probability is small that a rational 
bubble becomes arbitrarily large. Moreover, for exchange rates a positive 
bubble would imply a continual devaluation of the currency, and this can be 
ruled out by an argument symmetric to that used above for a negative 
rational bubble in stock prices. 

Diba and Grossman also show that, if a rational bubble does not exist at 
time t, then it cannot get started at any later date t+ i, i> 0, and that, if an 
existing rational bubble bursts, a new independent rational bubble cannot 
simultaneously start. Thus, if a rational bubble exists at time t, it must have 
started at time t=0 (the first date of trading of the asset), it must not have 
burst, it will not restart if it bursts and, if it is a bubble in a stock price, the 
stock has been continuously overvalued relative to market fundamentals. 

The presence of bubbles can be tested by examining their implications for 
cointegration between various series. When y; = yt , so that no bubbles are 
present, equation (10.13) implies that 


U1 = ye — Sly + Xt) 


395 Integrated financial time series: further topics 


must be I(0), and, as we have already shown, the spread S,= y,— 6x, must 
also be I(0), so that y, must be cointegrated with both x, and y,, 1+ x; (it 
must also be the case that Ay, is I(0)). If, on the other hand, a bubble is 


present, so that y, = yl + B;, the bubble must appear in both U, and S, Since, 
by definition, B, is non-stationary, these variables cannot be I(0) and the 
cointegration relationships cannot hold. 

Hamilton and Whiteman (1985) discuss these implications in more detail, 
showing that, if x,~ I(d), then rational bubbles can only exist if y,~ I(d-+ b), 
where b > 0. The finding that y, is of a higher order of integration than x; is 
not necessarily evidence in favour of bubbles, however. As Hamilton and 
Whiteman point out, such a finding might be explained by numerous other 
factors: what appears to be a bubble could have arisen instead from rational 
agents responding solely to fundamentals not observed by the modeller. 

One further important drawback with tests of stationarity and coin- 
tegration is the question of power. Diba (1990), for example, argues that, if B 
in (10.15) is sufficiently close to zero, the ensuing bubble would generate 
fluctuations in a finite sample that could not be distinguished from sta- 
tionary behaviour. Meese (1986) provides both simulation and empirical 
evidence on exchange rate bubbles that is consistent with this. 


Example 10.1 Testing stock market volatility 
Campbell and Shiller (1987) employ the ‘cointegration approach’ to test the 
present value model for annual data on the real S&P Composite price index 
(y) and the associated dividend index (x,) from 1871 to 1986. As a pre- 
liminary, unit root tests are needed to ensure that both y, and x; are I(1); this 
was indeed found to be the case. Less conclusive evidence was presented that 
the spread was stationary, which would imply that y, and x, are cointegrated 
(in practice, Campbell and Shiller use SL; = yp — 6x;_-; rather than S, to 
avoid timing problems caused by the use of beginning-of-year stock prices 
and dividends paid within the year). 

Nonetheless, assuming cointegration and using the cointegrating regres- 
sion estimate of 6 (an implied discount rate of 3.2 per cent), a second-order 
VAR was constructed for the bivariate (SL,, Ax;) process. The estimates 
suggested that dividend changes were highly predictable, and there was 
strong evidence that the spread Granger-caused dividend changes, one 
implication of the present value model. The restrictions (10.9) could not be 
rejected at conventional significance levels, and neither were the two variance 
ratios significantly larger than unity. 
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Markedly different results were obtained, however, when the sample mean 
return was used to calculate a discount rate of 10.2 per cent. Now the 
restrictions (10.9) could be rejected at low significance levels and the two 
variance inequalities were sharply violated. Campbell and Shiller suggest that 
the implied discount rate of 3.2 per cent obtained from the cointegrating 
regression may be too low, which might be argued to be consistent with the 
proposition that the cointegration parameter is estimating the discount rate 
«| rather than r. Nevertheless, although they prefer to use the higher dis- 
count rate of 10.2 per cent, which implies a 4.8 per cent growth in dividends 
and leads to excessive volatility, they do emphasise that the strength of the 
evidence depends sensitively on the assumed value of the discount rate. 

Updating Mills (1993), we apply this technique to UK data on real stock 
prices and dividends, obtained by dividing the FTA All Share price and 
dividend series used in previous examples by the retail price index. The series 
are shown for the period January 1965 to December 2005 in figure 10.1. 
Following the Dickey and Pantula (1987) approach of chapter 3, section 5, to 
testing for more than one unit root, we confirm that both series are I(1), thus 
ruling out the presence of rational bubbles. 

Are the two series cointegrated? A unit root test on the residuals from the 
cointegrating regression of real prices on real dividends yields the statistic 
C= — 3.98, which is significant at the 5 per cent level, although, on esti- 
mating a VECM, the trace statistic is only nọ = 11.9, which is significant at 
the 16 per cent level. Given this, albeit somewhat weak, evidence in favour of 
cointegration, we proceed by assuming that the series are cointegrated. The 
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estimates of the cointegration parameter 0 from the two approaches are 
37.55 and 38.89, respectively, implying respective discount rates of 2.7 per 
cent and 2.6 per cent. Since the two estimates are so close to each other, we 
continue the analysis using just the former. 

Fitting a fourth-order VAR of the form (10.5) leads to a Wald test statistic 
of the restrictions (10.9) taking the value x2 = 17.4, which is clearly sig- 
nificant, thus rejecting the present value restrictions. A test of S, Granger- 
causing Ax, however, has a p-value of only 0.01, which supports the present 
value model implication. The variance ratio inequalities are also violated: 


V(S,)/V(S?) = 3.6 and V(é,)/V(&) = 0.3. 


Example 10.2 Testing the expectations hypothesis of the term structure of 

interest rates 
Shiller (1979) shows that the expectations hypothesis of the term structure of 
interest rates — that the current long rate is the weighted average of current and 
expected future short rates — can be put into the form of the present value 
model (10.1). In this framework, y, is the current interest rate (the yield to 
maturity) on a long bond (strictly, a perpetuity), x, is the current one-period 
interest rate, 0 is set to unity, ô is a parameter of linearisation, typically set 
equal to (1+) ', and c is a liquidity premium unrestricted by the model. 

The expectations hypothesis thus asserts that, if y, and x, are both I(1), 
then the spread, S,= y;— x; (noting that 0 = 1), must be I(0) and hence that 
y,and x, must be cointegrated with cointegrating vector (1, —1). S,and Ax; 
then have the VAR representation (10.5) and the expectations hypothesis 
implies the restrictions given by equation (10.9), although the first of these 
can now be written as a, +c; = 1 + y. Equation (10.10) now has the 
implication that the excess return on holding a long bond for one period, 
rather than a one-period bond, should be unpredictable. 

Although this form of the expectations hypothesis is strictly valid only when 
the long rate is a perpetuity, it can still be used for bonds of finite, but very long, 
life — e.g. twenty years. Campbell and Shiller (1987) thus test the model using 
monthly data on the yield on US Treasury twenty-year bonds and one-month 
Treasury bill rates for the period 1959 to 1983. Evidence is presented that the 
spread is stationary, but a test of the restrictions (10.9) rejects the expectations 
hypothesis very strongly. Nonetheless, the variance ratios are not significantly 
different from unity and the ‘theoretical spread’, S*, is highly correlated with 
the actual spread, S,. Campbell and Shiller interpret these conflicting findings as 
evidence that deviations from the present value model are only transitory and 
suggest that the model does, in fact, fit the data comparatively well. 
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Updating Mills (1991b), we consider here how the expectations hypothesis 
fits the UK interest rate data used in many previous examples. Because the 
short rate is the yield on ninety-one-day (three-month) Treasury bills, we use 
quarterly, rather than monthly, data for the period from the start of 1952 to 
the end of 2005, a total of T= 216 observations. The spread, shown in figure 
10.2, is stationary, with an ADF test producing t,, = — 3.97, which rejects a 
unit root at the 1 per cent level (note that we can carry out a unit root test 
directly on the spread, rather than test for cointegration between y, and x, 
because the cointegration parameter is assumed to be 0= 1 a priori). 

A VAR(1) was then fitted to S,and Ax, and imposing the single restriction 
a, + cı = 1.072 in (10.9) leads to the test statistic xj = 3.70, which is not 
quite significant at the 5 per cent level. S, Granger-causes Ax; at the 0.01 per 
cent level and the variance ratio V(S,)/V(Sj) is just 0.9. The theoretical 
spread S% is very close to the actual spread, so that the evidence in favour of 
using the present value model to analyse the setting of interest rates is sur- 
prisingly strong. 


10.1.3 The ‘dividend ratio model’: a log-linear approximation to the present 
value model 


As has been remarked on above, the present value model (10.1) when applied 
to stock prices is specified in terms of the levels of prices and dividends, and 
this may present statistical problems if these series grow exponentially. To 
incorporate such non-stationarity, Campbell and Shiller (1988c) focus 
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attention on the logarithmic return. Recall the definition of the one-period 
return as, in this case, 


Troi = (Pipi + Dt — Pi) / Pt = (Pepi + D/P) — 1 


Taking logarithms and using the approximation 14) & log(1+ 741) = 
hy 41 yields 


hy 41 = log(P:41 + D;) — log(P,) 


Campbell and Shiller examine the relationship between hı ++ı and the 
logarithms of dividends and prices, d, and p,. The relationship is non-linear, 
of course, but can be approximated as 


hi t1 = k+ ppi + (1 — p)d; — pp = k+ ¿i — Péri + Ad; (10.16) 


where ¢, = d;-1 — pr is the logarithmic ‘dividend/price ratio’ or dividend 
yield, p is the average of the ratio P,/(P; + D,_;) and k= — log(p) — (1 — p) 
log(1/p — 1): see Campbell and Shiller (1988b, 1988c) for details of the 
derivation of equation (10.16). 

Equation (10.16) can be thought of as a difference equation relating ¢, to 
C Ad, and hy,,,,, and, on solving forwards and imposing the terminal 
condition that lim;_,. p'Z,,; = 0, we obtain 


o, = Ba (hiii — Adei) — = (10.17) 
As it stands, this equation has no economic content, since it simply says that 
C» the log dividend/price ratio, can be written as a discounted value of the 
differences between future returns and dividend growth rates discounted at 
the constant rate p, less a constant k/(1— p). Suppose, however, that, as 
before, expected one-period returns are constant: E(hy,141|®+) = r. Then, 
on taking conditional expectations of (10.17), we obtain 

z co; pak 
gS- POT PIB Ady |) + (10.18) 
which expresses the log dividend/price ratio as a linear function of expected 
real dividend growth into the infinite future. 

The restrictions implicit in (10.18) can be tested using a framework 
analogous to that developed in section 10.1.2 above, noting that, in this 
context, ¢, is the logarithmic counterpart of the spread S,= P,— 0D, We 
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therefore consider ¢,and Ad, to be generated by a VAR, which can be written 
in companion form as in equation (10.6) with z; = (¢,, Ad,)'. The implied 
solution to the present value model conditional on the restricted infor- 
mation set PF = (£, Ad?) is then 


g'z = -h (I -— pl) 'z, 
with the accompanying set of restrictions 
g +h'TM(I— pl) '=0 (10.19) 


As with the analogous set (10.9), these restrictions imply that E(h 41 | p7) = 0, 
so that returns are unpredictable. Moreover, as with the VAR of (10.6), a further 
implication of this model is that ¢, should Granger-cause Ad, 

Campbell and Shiller (1988c) argue that working with logarithms has cer- 
tain advantages over the approach developed previously when testing the 
implications of the present value model for stock prices. One advantage is that 
it is easy to combine with individual log-linear models of prices and dividends, 
which, as stressed by Kleidon (1986b), for example, are both more appealing 
on theoretical grounds and do appear to fit the data better than linear ones. 
A second advantage is that using the variables ¢, and Ad, mitigates measure- 
ment error problems that may occur when deflating nominal stock prices 
and dividends by some price index to obtain real variables. 

The model has been extended in various ways. Campbell and Shiller 
(1988c) allow expected log returns to be given by the model 
E (hi1 | ®,) = r + Rs, where R; is the real return on, for example, Treasury 
bills. In this case R+; — Ad; replaces — Ad,,; in equation (10.18) and 
Z = (6,,R:— Ad)! becomes the vector modelled as a VAR. Campbell and 
Shiller (1988b) focus attention on the j-period discounted return 


a 
hj: = y Phi +i 


which leads to the following set of restrictions on the VAR: 
g! (I— pI) +h' O(I- pon) (I — øm) = 0 


Although these restrictions are algebraically equivalent to those of (10.19) for 
all j, reflecting the fact that, if one-period returns are unpredictable, then 
j-period returns must also be, and vice versa, Wald tests may yield different 
results depending on which value of j is chosen. Nevertheless, the VAR 
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framework confers yet another advantage in this set-up: it needs to be esti- 
mated only once, as tests can be conducted for any j without re-estimating 
the system. 

Campbell and Shiller (1988b) also extend the VAR framework to 
incorporate a third variable, a long moving average of the earnings/price 
ratio, which is included as a potential predictor of stock returns. Campbell 
(1991), on the other hand, uses the model to analyse the unexpected com- 
ponent of returns, while Campbell and Shiller (1988a) concentrate on using 
the model to reinterpret the Marsh and Merton (1987) error correction 
model of dividend behaviour in the context of a ‘near-rational expectations’ 
model in which dividends are persistent and prices are disturbed by per- 
sistent random noise. 

Campbell and Shiller (1988b, 1988c) apply the dividend ratio model to 
various data sets, including an updated Standard and Poor’s. They find that 
the restrictions of the model tend to be rejected by the data and that the 
earnings variable is a powerful predictor of stock returns, particularly when 
returns are calculated over several years. 


Example 10.3 The dividend ratio model for UK equity prices 

This model was applied to the UK data analysed in example 10.1. As a 
prerequisite, we require that ¢, and Ad, are stationary. Example 3.1 has 
shown that the presence of a unit root in the levels of the dividend yield can 
be rejected, and a similar result occurs here for the logarithms: an ADF test 
rejects a unit root at the 5 per cent level. That Ad, is stationary has been 
reported in example 6.5. On fitting a VAR(3) to z: = (¢,, Ad;) ', we find that 
¢, does Granger-cause Ad, the marginal significance level of the test being 
less than 0.001. 

A Wald test of the restrictions (10.19) is equivalent to a test that the coeffi- 
cients in the regression of hı, , on lags of ¢, and Ad, are all zero — ie. that 
returns are unforecastable. Since the R° from this regression is around 0.9, it is 
clear that the dividend ratio model for UK equity prices is conclusively rejected. 


10.2 Generalisations and extensions of cointegration and error 
correction models 


10.2.1. Non-linear generalisations 


Given the tendency for financial time series to contain important non- 
linearities, it comes as no surprise that several models have been developed 
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that generalise cointegration and error correction models in non-linear 
directions. These have taken two general forms: a linear cointegrating vector 
has been allowed to enter as a non-linear error correction, and the coin- 
tegrating relationship itself has been allowed to be non-linear. For an 
overview of the issues involved in this context, see Granger and Terasvirta 
(1993). 

Granger and Swanson (1996) discuss instructive examples of non-linear 
cointegration and error correction. Suppose we are modelling the I(1) 
process Z; = (y, D and there is the single cointegrating vector e, = a! Z;. 
Granger and Swanson suggest the simple non-linear error correction 
model in which, rather than e,_, appearing, it is replaced by e}; and e}, 
defined as 


A =e + 
Bes, { Cail Ga >20 ee ea ee e 


0 otherwise 


A rather less ‘ad hoc’ generalisation follows from the interpretation of 
o.'z,= 0 as an attractor or equilibrium, so that e, is a measure of the extent to 
which the system is out of equilibrium. Thus, if the market ‘prefers’ e, to be 
small, there must be costs associated with having non-zero values of e, The 
traditional approach is then to assume a quadratic cost function, in which 
case the linear error correction model is obtained (see Nickell, 1985). If the 
cost function is non-linear, perhaps because of transaction costs that prevent 
profitable arbitrage for small deviations from equilibrium, or because het- 
erogeneity among arbitrageurs leads to differing transaction costs, then a 
non-linear error correction results, as, for example, 


g(Se-1) = (1 + exp(—Se-1)) '— 1 


A variety of theoretical justifications have been suggested to explain why 
non-linear error correction mechanisms should emerge. These include arbi- 
trage in the presence of transaction costs, heterogeneity among arbitrageurs, 
agents’ maximising or minimising behaviour, constraints on central bank 
intervention, and intertemporal choice behaviour under asymmetric adjust- 
ment costs. In terms of modelling, non-linear error correction mechanisms 
can be accommodated rather straightforwardly, in that residuals from some 
linear cointegration relationship can be incorporated in a non-linear error 
correction model. A simple way of capturing non-linear error correction 
relationships is by including asymmetric or higher-order polynomial error 
correction terms. 
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Threshold-type models have also been used in modelling non-linear error 
correction models, in which only large errors from equilibrium, above some 
threshold, are corrected. For example, Balke and Fomby (1997) use a three- 
regime model of the cointegrating relationship with the process following a 
unit root in the middle regime. Granger and Terdsvirta (1993) extend the 
single threshold to include a smooth transition in the adjustment towards 
equilibrium, and this approach has become quite popular. Psaradakis, Sola 
and Spagnola (2004) employ a non-linear error correction model that fol- 
lows a Markov switching process, where deviations from equilibrium are 
non-stationary in one state and mean-reverting in the other. 

Error correction models can also be estimated using non-parametric 
approaches, such as neural networks (e.g. Haefke and Helmenstein, 1996). 
Rather than concentrating on the conditional mean, Lee (1994) uses a lagged 
error correction term to model the conditional variance of a series, terming 
this a GARCH-X process. This approach seems useful for situations where 
disequilibrium shocks to a cointegrated system are likely to cause increases in 
the variance — for example, the relationship between spot and futures prices. 

Several procedures have been proposed for evaluating the null hypothesis of 
a unit root against a threshold autoregressive-type alternative. Michael, Nobay 
and Peel (1997) test the null hypothesis of a unit root in real exchange rates 
against a STAR error correction process when analysing the long-run pur- 
chasing power parity (PPP) relationship. Kapetanios, Shin and Snell (2006) 
have recently developed tests for the null of no cointegration that have power 
when the cointegrating error follows a stationary smooth transition process. 
One test is based on the cointegrating residuals, the other evaluates the null 
hypothesis by examining the significance of the parameter that controls the 
non-linearity in the speed of the error correction adjustment. 

Non-linear cointegration and, in general, non-linear relationships 
between non-stationary variables bring about several econometric problems 
that appear to be different from those associated with non-linear error 
correction. At the theoretical level, some of the issues related to the 
asymptotic behaviour of non-linear transformations of non-linear time 
series are addressed in Park and Phillips (1999, 2001), where an asymptotic 
theory for non-linear regression with integrated processes is developed. For a 
broad family of bivariate non-linear regression functions, sufficient condi- 
tions for weak consistency, rates of convergence and limit distributions are 
obtained. When the regression functions are integrable and the errors are 
martingale differences, the estimators are shown to be mixed normal and 
standard non-linear inference is applicable. 
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At both the conceptual and modelling level, Granger and Hallman (1991) 
first consider the possibility of non-linear cointegration as a bivariate 
‘attractor between variables that are individually EMM but have an SMM 
non-linear combination. They suggest that non-linear equilibrium rela- 
tionships could emerge between prices of commodities traded at spatially 
separated markets due to the existence of varying marginal costs and profits. 
Granger and Hallman employ the alternating conditional expectations 
(ACE) algorithm to estimate bivariate non-parametric cointegrating 
regressions, and offer some simulation evidence regarding the behaviour of 
Engle and Granger’s (1987) cointegration tests using ACE residuals. Creedy, 
Lye and Martin (1996) use a non-linear cointegration relationship to 
derive an error correction model in continuous time, the dynamics of which 
follow a generalised exponential stationary distribution. This was estimated 
via ML and allows a rich variety of distributional shapes, exhibiting prop- 
erties such as leptokurtosis and multimodality. Creedy, Lye and Martin 
suggest that such models are useful in modelling switching behaviour between 
multiple equilibria, and discuss an empirical example using exchange rate 
data. 

Hall, Psaradakis and Sola (1997) model non-linear cointegration between 
consumption and disposable income as a regime-switching process where 
shifts follow an unobserved Markov chain with unknown transition prob- 
abilities. Pfann, Schotman and Tschernig (1996) demonstrate that univariate 
non-linearities in the short-term interest rate can produce non-linear 
cointegration relationships between long and short interest rates, using self- 
exciting threshold autoregressive (SETAR) models to demonstrate the 
empirical validity of their case. Basu and Dua (1996) show that non- 
homogeneous utility functions for cash and credit goods imply non-linear 
cointegration between income velocity, nominal interest rates and real GDP. 
Finally, Bollerslev and Engle (1993), among others, have applied the coin- 
tegration concept to modelling common persistence between conditional 
variances. 

Departures from the standard cointegration framework can also lead to 
time series behaviour that is consistent with non-linear cointegration. For 
example, non-linear equilibrium relationships may arise within the seasonal 
cointegration framework of Hylleberg et al. (1990) and the non-linear sto- 
chastic trend models of Granger, Inoue and Morin (1997). Siklos and 
Granger (1997) argue that cointegrating relationships may switch according 
to the monetary policy regime, and propose the concept of temporal coin- 
tegration to allow variables to be cointegrated in one regime and 
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non-cointegrated in another, using this approach to test the hypothesis of 
uncovered interest rate parity. Granger and Yoon (2002) consider the case of 
‘hidden’ cointegration, where cointegrated variables respond only to certain 
kinds of shocks, say positive or negative. They argue that this may be 
applicable, for example, when modelling the response of central banks, 
which may be more interested in rising than falling interest rates. A 
‘crouching’ error correction model is proposed to model variables related 
through hidden cointegration. 

An innovative regime-switching non-linear cointegration process has been 
developed by Granger and Hyung (2006) using ‘m-m models. In this con- 
text, two variables, x and y, vary according to a switching regime process that 
allows mixed integration and cointegration. In each step, a max or min 
operator is used to choose between integration (e.g. x;41;=X;+é;) or 
cointegration (e.g. x,4,=by,+€,) for each variable. Although in simple 
cases m-m processes imply linear cointegrating relationships, they always 
have threshold-type non-linear error correction representations. 

Corradi, Swanson and White (2000) replace the concept of cointegration 
with the more general concept of linear stochastic comovement. This gen- 
eralisation allows for non-linear cointegration, and is defined as the condi- 
tion whereby linear combinations among the components of non-ergodic 
non-linear Markov processes produce ergodic non-linear Markov processes. 
Existing testing procedures for stationarity and cointegration are shown to 
be applicable in the linear stochastic comovement context. Moreover, 
Corradi, Swanson and White propose a consistent test for the null of a linear 
cointegration vector against general non-linear alternatives. 

A more general definition of cointegration can be based on a set of non- 
stationary variables that do not necessarily have equal orders of integration 
but can be expressed in a ‘more stationary’ linear or non-linear combination. 
Such definitions allow for cointegration between fractionally integrated 
variables and between integrated variables of order greater than one. In 
practical terms, fractional cointegration can be used to model slow error 
correction adjustment towards long-run equilibria. Abadir and Taylor 
(1999) show that linearity of the cointegrating regression requires that the 
variables have identical orders of integration. This makes linear cointegra- 
tion a nested and composite hypothesis, since it is not possible to specify the 
distribution theory for cointegration testing until a common order of inte- 
gration has been established. 

These problems can be avoided if the testing procedures allow for flexible 
functional forms and fractional unit roots. Cointegration between variables 
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with unequal orders of integration requires that the cointegrating function is 
non-null and, possibly, non-linear. Motivated by the possibilities of frac- 
tional cointegration, a number of researchers have investigated if deviations 
from a cointegrating relationship follow a fractionally integrated process, 
and have devised relevant testing procedures. For example, Gil-Alana (2003) 
has proposed a simple two-step test of fractional cointegration in the spirit of 
Engle and Granger (1987). The procedure tests initially the order of inte- 
gration of the series and then examines the degree of integration of the 
residuals from the cointegrating regression. In both steps the univariate 
fractional unit root test of Robinson (1994) is employed and relevant critical 
values are derived for finite samples. In an empirical application, Gil-Alana 
analyses the data used by Engle and Granger (1987) and Campbell and Shiller 
(1987) and presents evidence of fractional cointegration between con- 
sumption and income, nominal GNP and money, and stock prices and 
dividends. 

Although standard linear cointegration tests will have some power against 
non-linear cointegration, a number of specialised testing procedures have 
been developed. Bierens (1997) suggests consistent cointegration tests that 
do not require strong assumptions about the data-generating process or the 
estimation of nuisance parameters. These tests can be considered as an 
extension of the Johansen (1995) testing procedure outlined in chapter 9, 
section 5.3, since they are also based on the ordered solutions of a generalised 
eigenvalue problem. Non-parametric testing procedures for non-linear 
cointegration have also been suggested by Breitung (2001), being based on 
the difference between the sequences of ranks. In an extension of the com- 
mon features concept, Anderson and Vahid (1998) have proposed a gener- 
alised method of moments test for common non-linear components of 
specified or unspecified form between stationary time series. Bierens (2000) 
employs a generalised eigenvalue procedure to test non-parametrically for a 
special common features case concerning the existence of non-linear 
co-trending, where there are common non-linear deterministic time trends. 
Breitung (2002) proposes a generalisation of a variance ratio type statistic, 
similar to the KPSS test, to test the cointegration rank as in the Johansen 
approach. 

Parametric tests of specific non-linear cointegrating relationships have 
also been considered. For example, Choi and Saikkonen (2004) discuss 
procedures for testing linearity in the cointegration relationship against 
alternatives from the smooth transition family of models. By employing a 
Taylor expansion of the transition function, the problem of unidentified 
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nuisance parameters in the non-linear models under the null can be avoided. 
The derived tests are simple to implement, as they are based on OLS 
regression and x’ limit distributions. 


Example 10.4 Non-linear cointegration in the S&P 500 dividend 

discount model 
As we have seen in example 10.1, Campbell and Shiller (1987) provide less 
than conclusive evidence that the ‘spread’ between the real price and divi- 
dend indices for the S&P index was stationary. In this example we investigate 
the possibility of non-linear cointegration between the price and dividend 
indices over the period 1871 to 2002. 

The dividend yield and the scatterplot of prices and dividends are shown in 
figure 10.3. Although the dividend yield appears to exhibit mean reversion, it is 
not obviously stationary, especially over the more recent years of the sample 
period. At the same time, the scatterplot of prices and dividends suggests that 
the relationship is weaker at higher data values. Although most of the non- 
linearity disappears on taking logarithms, evidence in favour of linear coin- 
tegration is still not clear-cut, with the trace and maximum eigenvalue test 
statistics confirming that the null hypothesis of non-cointegration cannot be 
rejected at the 5 per cent level for all test configurations. 

Evidence provided by Gil-Alana (2003) suggests that the two series may be 
fractionally cointegrated between 1871 to 1986, which is the sample period 
analysed by Campbell and Shiller (1987). We find similar evidence by using 
the Robinson (1995b) semiparametric fractional unit root estimator (see 
chapter 4, section 3.3), finding estimates of d for the logarithmic price and 
dividend series equal to 0.476 and 0.484, respectively. Since these estimates 
may be biased by short-run dynamics, however, we also estimated d within 
an ARFIMA(1,d,1) framework for both series. This approach produces 
estimates closer to unity with d = 1.046 and 1.061, respectively. Neverthe- 
less, using Robinson’s estimator and an ARFIMA(0,d,0) model, we find that 
the residuals from the linear cointegrating regression between the two series 
seem to be fractionally integrated with estimates of d equal to 0.414 and 
0.925, respectively. 

Since these results may be influenced by the assumption of a linear 
cointegration regression, we considered the non-linear cointegration tests of 
Bierens (1997, 2000) and Breitung (2002). These produced rather mixed 
evidence of non-linear cointegration, so we took a pragmatic approach and 
examined the actual predictive ability of various error correction specifica- 
tions First, a linear error correction model for the logarithmic S&P 500 
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Figure 10.3 S&P dividend yield and scatterplot of prices and dividends (annual 1871—2002) 
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returns was estimated using OLS: 


Ay; = 0.041 — 0.116 z; pO ao Aye IREPO ee 1 
(0.016) (0.052) (0.171) (0.130) 


R? = 0.0866 


Here z,_, is the lagged residual from the cointegrating regression between 
prices and dividends; standard errors are shown in parentheses. Incorporating 
positive and negative error correction terms separately shows that there is some 
asymmetry in the model since only negative residuals, z;_,, turn out significant: 


Aye 0.209 a= pan Ra preni e 1 R? = 0.0834 
0.067 


By allowing for a fractional unit root in the error correction process, the 
following ARFIMA model was estimated via Gaussian ML: 


Aly, = a, 0. 039 — 0. 107 Tee i + Et — 0. a8 Et 1 R? = 0.0791 
(0.009) (0.043) (0.149) 


The fractional differencing removes some of the negative persistence in the 
series, rendering the autoregressive coefficient insignificant. Finally, an 
LSTAR error correction model was also estimated via ML, assuming a 
skewed student’s t distribution. The lagged cointegration errors were used as 
transition variables, while only the error correction term parameter and 
intercept were allowed to switch: 


Ay, = 0.033 — 0.165 z,_; — 0.762 Ay,_1 


(0.023) (0.077) (0.170) 
Ši 
£ ( (1.289 — 1.541) a) (: $ ap(-1 13.03 (z = 0.368 )) 
(0.365) (0.384) 27) (0.079) 
+E — 0.916 €41 
(0.079) 


Although the LSTAR model appears to have a superior fit compared to the 
previous specifications, one must also consider that two additional param- 
eters are needed for the transition function. Moreover, it is interesting to 
observe that the speed of adjustment to equilibrium does not differ signifi- 
cantly between the two regimes. 


10.2.2 Testing for cointegration with infinite variance errors and structural breaks 


In chapter 7 we presented considerable evidence that returns distributions 
were usually fat-tailed and may have infinite variance, and discussed briefly 


410 


The Econometric Modelling of Financial Time Series 


the impact of such an assumption on unit root tests for individual series. 
Caner (1998) considers the impact of infinite variance errors on both the 
residual-based tests of cointegration proposed by Phillips and Ouliaris 
(1990) and the VECM reduced-rank likelihood tests associated with 
Johansen (1995). Assuming that the errors are stable, rather than normal, 
results in statistics that are functionals of stable processes rather than 
functionals of Brownian motion, and that depend on the value taken by the 
tail index as well as the number of variables in the system. Critical values are 
larger (in absolute value) than the conventional critical values, so the size 
distortions induced by incorrectly assuming normality will lead to over- 
rejection of the null of no cointegration. Caner shows that these size dis- 
tortions are only moderate for the residual-based tests, however, although 
they are more substantial for the trace and A-max test statistics based on ML 
estimation of the VECM representation. 

There have been several analyses looking at cointegration in the presence 
of structural shifts. Campos, Ericsson and Hendry (1996) consider the 
properties of several regression cointegration tests when one of the variables 
in the cointegrating relationship contains a structural break, finding that the 
break has little effect on the size of the tests but can affect their power when 
the data does not have a ‘common factor’, thus pointing towards the use of 
tests based on the ECM rather than on the cointegrating regression (recall 
chapter 9, section 3). Quintos and Phillips (1993) propose a recursive LM 
statistic to detect breaks in the cointegrating vector, while Quintos (1997) 
develops an extension of the fluctuation test, introduced in chapter 8, section 3.3, 
to examine both parameter and rank stability in VECMs. Other tests have been 
analysed by Gregory and Hansen (1996) and Gregory, Nason and Watt (1996), 
while Perron (2006) contains a recent survey of developments. 

Siklos and Granger (1997) look at the problem of structural breaks in 
cointegrating relationships from an alternative perspective: their argument is 
that an empirical finding of non-cointegration could be a consequence of 
external shocks or structural breaks that interrupt the underlying equilib- 
rium relationship, perhaps for an extended period of time. This leads them to 
define the concept of temporary cointegration, in which the cointegrating 
relationship can be switched on or off depending on the nature of the policy 
regime that is in place. 
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The following series can be obtained from http://lboro.ac.uk/departments/ec/cup. 


RS: 91-day Treasury bill rate, monthly, March 1952 to December 2005 (648 observations). 
R20: Yield on 20-year UK gilts, monthly, March 1952 to December 2005 (648 observations). 
RSQ: 91-day Treasury bill rate, quarterly, 1952Q1 to 2005Q4 (216 observations). 

R20Q: Yield on 20-year UK gilts, quarterly, 1952Q1 to 2005Q4 (216 observations). 
RSQREAL: Real 91-day Treasury bill rate, quarterly, 1952Q1 to 2005Q (216 observations). 


FTAPRICE: FTA All Share price index, monthly, January 1965 to December 2005 (492 
observations). 


FTADIV: FTA All Share dividend index, monthly, January 1965 to December 2005 (492 
observations). 


FTARET: FTA All Share nominal returns, monthly, January 1965 to December 2005 (492 
observations). 


RPI: U.K. Retail Price Index, monthly, January 1965 to December 2005 (492 observations). 


USTB: U.S. 3-month Treasury bill rate, monthly, April 1953 to February 2005 (623 
observations). 


GIASE: Absolute returns on the General Index of the Athens Stock Exchange, daily, 1 June 
1998 to 10 September 1998 (12,117 observations). 


NORD: Nord Pool Exchange electricity prices, daily, 22 March 2002 to 3 December 2004 
(988 observations) 


VIX: VIX prices, daily, January 1990 to September 2005 (4130 observations) 
EXCHD: Dollar/sterling exchange rate, daily, 1974 to 1994 (5192 observations). 
EXCHQ: Dollar/sterling exchange rate, quarterly, 1972Q1 to 1996Q4 (100 observations). 
S&P500: S&P 500 index, annual, 1871 to 2006 (136 observations). 

S&P500R: S&P 500 real returns, annual 1872 to 2006 (135 observations). 
S&P500D: S&P 500 index, daily, 1928 to 1991 (17,054 observations). 

FT30: FT 30 index, daily, 1935 to 1994 (15,003 observations). 

FTSE100: FTSE 100 index, weekly, 1984 to 1993 (521 observations). 

CTLD: Courtaulds share price, weekly, 1984 to 1993 (521 observations). 
LGEN: Legal and General share price, weekly, 1984 to 1993 (521 observations). 
PRU: Prudential share price, weekly, 1984 to 1993 (521 observations). 
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